Win a copy of Murach's Python Programming this week in the Jython/Python forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

i/o streams and charsets...help  RSS feed

 
jim merrell
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello,
I was wondering if anyone could help clairify something for me. I don't really know how the character sets work so by all means let me know if my thinking is all wrong. Sorry in advance for long post.
Say user 1 has a system default charset of ASCII. They write a message in a JTextArea and hit the save button. The pgm calls JTextArea.write(myFileWriter) which saves the text to a file (using system default charset). They send the file to another user whose default charset is UTF-16. If the pgm simply loads the file into the JTextArea using JTextArea.read(myFileReader), wouldn't the text message get jumbled up? The UTF-16 machine would be reading two bytes per character when in fact the file was written out as 1 byte per character. Same is true the other way around. When the ASCII user loaded a UTF-16 file, it would treat each byte as 1 character when in fact two bytes represent one character. That is where the confusion is.
The only way I could see to control this was to have a rule that says the files will always be in a specific format, say ASCII? Then before writing the contents to file, I would call String.getBytes("ASCII") on the of the JTextArea -- when doing this on the UTF-16 machine, I assume if it encountered a char whose value was > 255 it would simply convert it to some char like "?" whose value was <= 255 so it would fit in 8 bits? Then write that byte[] to the output stream.
Then to load the file, instead of using JTextArea.read(), I would have to read the bytes into a byte array then create a new String using String(byte[], "ASCII") and pass that to the JTextArea?
Any dropped information from the UTF-16 file would simply show up as "?" on the ASCII machine. On the UTF-16 machine, everything would look fine? No double spaced characters or such?
Is there another way?
Jim
 
Steve Deadsea
Ranch Hand
Posts: 125
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That is pretty much correct.
Your first option is to use a common character set everywhere. I might suggest UTF-8 since that can handle any unicode character at all, and it will look like ASCII if only ascii characters are used.
Your second option is to record the character set in the file metadata, perhaps the file name. When you would open the file, you would have to specify the correct character set based on the file name.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!