• Post Reply Bookmark Topic Watch Topic
  • New Topic

A weird thing on CR and LF  RSS feed

 
Tracy Tse
Greenhorn
Posts: 22
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Deal all, i encoutered a weird problem, ok, let me tell you.
First of all ,i wrote a program to create a text file called unicode.txt which was encoded using UTF-16LE charset.

The content was only a single '\n', and i verify that the file size was 4 bytes , besides the binary representation was

from my ultraedit.

Then i wrote another programme trying to read the file bytes by bytes

and the output just drove me crazy, it was
10
0(null)10
0(null)10
End of stream 5 bytes read.

P.S (null) is for the character whose ascii value is zero .
So, my question is why there is no 13(0x0D) in the output and how come the last 10 exist ?

Please explain ...

Thanks in advance !
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You need to be careful about encodings; any time you use methods like getBytes() and toString() without specifying an encoding you risk conversion problems. Rewrite the code to only deal with bytes -not characters and strings- and you should get exactly what's in the file.
 
Rob Spoor
Sheriff
Posts: 21092
85
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tracy Tse wrote:

Wait, what? You first read the entire file into a String, then convert that into a byte[], then read from that again? Why not replace it with this:


That can be simplified without creating a new Integer and Character object:
Because in the first two ways the first value is a String, all + operations perform a String concatenation. The third form is actually what the second form does without appending the empty String at the start.
 
Tracy Tse
Greenhorn
Posts: 22
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ulf Dittmer wrote:You need to be careful about encodings; any time you use methods like getBytes() and toString() without specifying an encoding you risk conversion problems. Rewrite the code to only deal with bytes -not characters and strings- and you should get exactly what's in the file.

i appreciate your advice ,but i just could not figure out what in essence the problem is ?

my file was encoded using utf-16le, and now i wanna read the file bytes by bytes (i.e. treat it as a ascii text file),so i use the getBytes method without specifying an
encoding (by default it uses the platform's default charset , my OS is WindowsXP SP3).
 
Rob Spoor
Sheriff
Posts: 21092
85
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What does BufferedInputFile.read look like? If that code is not using UTF-16LE for reading the contents then that's where the problem lies.
 
Tracy Tse
Greenhorn
Posts: 22
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rob Prime wrote:
Tracy Tse wrote:

Wait, what? You first read the entire file into a String, then convert that into a byte[], then read from that again? Why not replace it with this:


That can be simplified without creating a new Integer and Character object:
Because in the first two ways the first value is a String, all + operations perform a String concatenation. The third form is actually what the second form does without appending the empty String at the start.

thanks for the code optimization suggestions ,i rewrite the code according to your thoughs , and it works .
So i guess the problem is below

And the source code for the implementation of BufferedInputFile is below

what do you think ?
 
Tracy Tse
Greenhorn
Posts: 22
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rob Prime wrote:What does BufferedInputFile.read look like? If that code is not using UTF-16LE for reading the contents then that's where the problem lies.

please see my previous reply !
 
Rob Spoor
Sheriff
Posts: 21092
85
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
new FileReader always uses the default encoding. Replace it with "new InputStreamReader(new FileInputStream(filename), "UTF-16LE")" and see if that works.
 
Tracy Tse
Greenhorn
Posts: 22
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks, it worls .
 
Rob Spoor
Sheriff
Posts: 21092
85
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You're welcome.
If the BufferedInputFile.read method is used in more places with different encodings you may want to consider adding the encoding as a parameter. You can overload the method to use a default encoding:
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!