Win a copy of Murach's Python Programming this week in the Jython/Python forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Character encoding question  RSS feed

 
Toby Eggitt
Ranch Hand
Posts: 53
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi all,

I have a problem with character encoding. Here's my code:



There's also a text file attached (I hope!). It's not got the right name, as JavaRanch's security system forced me to change it (can't attach .txt? Really, why not Paul?) This file is (or at least started out) in Ubuntu Linux default platform encoding. I think it's UTF-8. It includes a couple of Chinese characters.

So, here's my problem. As written, the program works perfectly. (Yeah, right, you wouldn't think I'd be complaining, would you But the thing is, it works with the CharsetDecoder in "iso-8859-1" mode, as above, but _fails_ if I select either "UTF-8" or defaultCharset().

This is puzzling to me. For one thing, I was under the impression that 8859-1 was essentially a fancy name for ASCII, and was a single-byte character set. So, how the heck could it contain the Chinese characters? Second, I was "fairly sure" that Linux was using UTF-8, so why would it fail with that selected. Even more, why would it fail with defaultCharset().

It seems clear to me that there's something basic that I don't understand, but I don't quite know what it might be.

Any suggestions, Admiral?
Cheers,
Toby.




shorttxt.png
[Thumbnail for shorttxt.png]
 
Paul Clapham
Sheriff
Posts: 22185
38
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Presumably "System.out.write" mangles non-ASCII characters in a certain way in your environment, with your console configuration, and that business with ISO-8859-1 is designed to (or happens to) do the exact reverse of that mangling. In other words it's a hack to get around a conflict between the charset used by your console and the default charset used by System.out.

And ISO-8859-1? It's an extension of ASCII -- one of many -- which uses the range from 0 to 255 and fills it in with Latin characters. The range from 0 to 127 is the same as ASCII and the rest is "extended ASCII". Have a look at its Wikipedia page.
 
Toby Eggitt
Ranch Hand
Posts: 53
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, I wondered about that, but I get the same output, in success and failure modes, both in Eclipse and in the console. Further, I can edit the original file in Eclipse, and it shows up, and edits, correctly in that editor, and also I can edit in gedit too. That suggests that those tools are recognizing the multi-byte characters as such (otherwise wouldn't I need to cursor-right twice to skip over the character, for example?) Linux reports the file type as "UTF-8 Unicode text" so it looks like everything except Java is behaving both as expected, and consistently among multiple different applications.

Does my logic seem sound? If so, I'm still puzzled...

Out of interest, did you try the code and get different results on another platform?

Cheers,
Toby
 
Paul Clapham
Sheriff
Posts: 22185
38
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
No, I didn't try the code because I would have had to put together an input file to test it with. And besides I don't know how your console is configured. Not to mention I don't have a Linux machine to try it on.

But I've seen this hack elsewhere, I remember seeing a post from somebody who had configured their Oracle database to use a Latin charset, and then they were putting Chinese characters into it with code very much like that (only without NIO), and then they tried to access the database a different way and the hack stopped working.

I can see how the input half works, it takes a Chinese character which requires 3 bytes when encoded in UTF-8, and just converts it straight across to 3 chars via ISO-8859-1. As for the output half: "cb.get()" returns a char, but System.out.write() treats that char as a byte, so that's just a reversal of the input half. So my question is, why all the hacking about with charsets and encoding and decoding? Why not just read the input file as a stream of bytes and use System.out.write to write those bytes to the console? Eliminate the charsets entirely and never use a char variable.
 
Toby Eggitt
Ranch Hand
Posts: 53
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hmm, I hope that didn't come over as grumpy: I wasn't expecting anyone to run it, though I did attach the input file (just with the name changed to look like a png, because of the curious security thing the forum has that doesn't allow me to attach a text file). No problem though, I was only wondering if you had experienced success on a particular platform.

The reason for doing all that mucking about with CharsetDecoders is to learn how to use them properly so I can do so when I have a real need ;)

However, I think you nailed the problem. it was "working" only because the single byte charset iso-8859-1 didn't bugger anything up but just passed through three strange characters in a row. This seems to work better:


Thanks for the help, much appreciated
Cheers,
Toby.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!