If it were just the command prompt that had a problem, I'd say it's Windows and OSX that are at fault, but since it fails even when I try to display it using the AWT I'm thinking I need to change something within Java. Do I need an international edition? If so, where would I download it (Google didn't turn one up)? Or do I just need to import some package to make Unicode magically work? Or is Scanner not the way to read in Unicode non-Roman characters?
Platform is Windows XP SP3. I'd prefer if it worked on OSX as well, but I'd settle for just Windows. The Cyrillic text is displayed correctly in Notepad, and I have chosen "Unicode" uncoding (other options are Unicode big endian and UTF-8 - UTF-8 causes an exception and big endian results in slightly different gobblydygook).
Thanks - never thought such a seemingly simply task would be so difficult!
Terminals and consoles are not good testbeds for Unicode text, since most of them only support the ISO-8859-1 character range.
How are you constructing the Scanner? You need to tell it which encoding the text is in; there are several Scanner constructors that take an encoding as an additional parameters.
Lastly, you can check which characters have been read by iterating through the resulting String, and printing out the Unicode values by calling String.codePointAt(int).
Suggest you scan a little of the file, then split your text into chars with the String#toCharArray method then print each char using a %x tag so it comes out in hex.
Then compare the values with the Unicode (I think they will run from 0410 to 044f); if they are correct then you can presume the Java is reading the text correctly.
The bit about (i & 7) == 0 ? '\n' : '\0' inserts a newline every 8 places. You can see it works nicely on a Linux console.
Try with an IDE like Eclipse or NetBeans which are written in Java and ought to support Unicode for their displays.
campbell@linux-pgix:~/java> java RussianPrinter
Change 0x043f to 0x044f[/edit]
[ September 25, 2008: Message edited by: Campbell Ritchie ]
I'm guessing I need this one:
public Scanner(InputStream source,
but I'm not sure which charset to use. The online documentation lists these:
US-ASCIISeven-bit ASCII, a.k.a. ISO646-US, a.k.a. the Basic Latin block of the Unicode character set
ISO-8859-1 ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1
UTF-8Eight-bit UCS Transformation Format
UTF-16BESixteen-bit UCS Transformation Format, big-endian byte order
UTF-16LESixteen-bit UCS Transformation Format, little-endian byte order
UTF-16Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark
Would I use UTF-16? And create scanner like this:
I actually tried NetBeans and got gobblydygook there as well, which rather surprised me.
One was that the NetBeans console, it seems, really does not support Unicode - javax.swing does, however, so I'll just need to brush up my Swing skills.
The other was the encoding - with %x I was able to get the hex values well enough, but I was getting ASCII decimal values and thus way too low of hex values. I couldn't find the byte-order mark, and a variety of charsets were failing, so I eventually decided maybe Notepad's Unicode was subpar. Sure enough, when I saved my file in Microsoft Word, things started working almost right away. Both the UTF-8 and Windows Cyrillic charsets worked perfectly in Java once it was Word that I was editing the file in.
So it looks like as long as I use Word as my text editor I should be okay from here on out - thanks for the help!