Win a copy of Cross-Platform Desktop Applications: Using Node, Electron, and NW.js this week in the JavaScript forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Problem with writing a String containing both English and Non-English chars into a text file.  RSS feed

 
Joseph Sweet
Ranch Hand
Posts: 327
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

I am on Windows XP OS, latest Java SDK and latest Eclipse.

I am reading a string from an input file, which contains both English and non-English characters. (I have installed on Windows the required support for the other language, so I can read it on Windows apps such as notepad).

While debugging with Eclipse, I see that the string seems to contain the data correctly: both English and non-English chars seems to be Okay.

Now I am converting the string to byte array and write it to the output file. But then when I open that file with notepad, I see that all non-English chars were converted to question marks.

What might be the problem???

Here is the relevant code:



Jericho HTML Parser

 
Rob Spoor
Sheriff
Posts: 21048
85
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I suggest you try with a proper editor first. Notepad is barely worthy of the name "text editor". In fact, it is so limited (e.g. encodings, line breaks, file size) you can barely call it a program. Notepad++ or PSPad are both free and are said to be quite good.

If you still see problems then check the encoding you are using when writing to the file. Check out String.getBytes(String) or String.getBytes(Charset)
 
Joseph Sweet
Ranch Hand
Posts: 327
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for the idea of String.getBytes(Charset), it worked out this time.
But I do not understand the concept of the pertinence of Charset while converting a String into Bytes.... Why does it matter according to which Charset the string is expected to be read.....? After all am I not just taking every byte in the String and pushing it to the next place in the Byte array.......? I can't see what it has to do with the Charset with which I would later want to construe the Byte array.

P.S. I do have PSPad editor, but for some reason it sometimes collapses with large files.
 
Ireneusz Kordal
Ranch Hand
Posts: 423
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Joseph Sweet wrote:
But I do not understand the concept of the pertinence of Charset while converting a String into Bytes....

Each char in the string is two-bytes unicode character.
Charset provides rules how to map char values (two bytes - 65535 possible values) into the byte values (one byte - 256 possible values).

look at this example - conversion of polish 'ł' char using different charsets:

result is:

polish character 'ł' is supported only by encodings windows-1250 and ISO-8859-2 - and for this two encodings conversion works fine.
The others give strange results.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!