Win a 3 month subscription to Marco Behler Videos this week in the Spring forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

getBytes() with UTF-8 and UTF-16  RSS feed

 
Aleksey Movchan
Ranch Hand
Posts: 42
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello.

Could anyone explain please why



displays "1" (not using BOM), but



displays "4" (using BOM FEFF + 0041  I suppose).
 
Stephan van Hulst
Saloon Keeper
Posts: 7487
135
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In both UTF-8 and UTF-16, the character 'A' is encoded using 1 code unit. UTF-8 uses 1 byte per code unit, and UTF-16 uses 2 bytes per code unit. The "UTF-16" encoding also adds a two byte BOM to the start.

If you don't want to use the BOM, you have to use either "UTF-16LE", or "UTF-16BE".
 
Campbell Ritchie
Marshal
Posts: 54897
155
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
...and welcome to the Ranch

This discussion is too difficult for the “Beginning” forum, so I shall move it.
 
Aleksey Movchan
Ranch Hand
Posts: 42
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you

But why then some other symbols need so many bytes?


The result is "5" in this case.
 
Stephan van Hulst
Saloon Keeper
Posts: 7487
135
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It doesn't. The Cyrillic letter Che uses 2 code units in UTF-8.

UTF-8 uses a variable number of bytes to encode characters. If it only needed one byte per character, it would only be able to map 256 characters. Instead, it uses an elaborate scheme that uses more bytes for characters that are less widely used. For instance, ASCII characters are represented by one byte, while musical symbols require four bytes. Cyrillic requires 2 bytes.
 
Aleksey Movchan
Ranch Hand
Posts: 42
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:It doesn't. The Cyrillic letter Che uses 2 code units in UTF-8.


Well, it does on my PC.
And the strange thing that sometimes it uses 5 bytes, sometimes 4 (13 or 12 bytes for "ЧЧЧ"). I'm quite sure results were different when I compiled it last time.

 
Aleksey Movchan
Ranch Hand
Posts: 42
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I found the problem: Java doesn't recognize my Cyrillic letters
Does anyone know how to fix that?

 
Aleksey Movchan
Ranch Hand
Posts: 42
1
  • Likes 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Oh god, I fixed it. Kinda proud of myself

It's working when I use "javac -encoding UTF-8 Test.java", but the main problem was that my Windows 7 didn't recognize cyrillic symbols in text files.

Control Panel -> Region and Language -> Administrative -> Language for programs that do not support Unicode -> Russian

 
Campbell Ritchie
Marshal
Posts: 54897
155
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Aleksey Movchan wrote:. . . my Windows 7 didn't recognize cyrillic symbols in text files. . . .
A common problem with the Windows® command line; it only supports a very restricted range of characters, not the same as this Unicode page plus ASCII.
Try the following instead of System.out.println:-
JOptionPane.showMessageDialog(null, "АБВГДЕ");
You have to import option pane: import javax.swing.JOptio‍nPane;
 
Stephan van Hulst
Saloon Keeper
Posts: 7487
135
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Great work, have a cow!

After fixing it, does the program return 2 bytes for Cyrillic characters in UTF-8?

That it returns 4 is strange and shouldn't happen, but that it returns 5 should be downright impossible because the character ranges that require 5 and 6 bytes in UTF-8 are simply not defined.
 
Aleksey Movchan
Ranch Hand
Posts: 42
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks a lot!

I changed my "Region and Language" standards back to demonstrate you how it works on my computer with and without "-encoding UTF-8" flag.
Somehow it uses 5 bytes for letter "Ж":

 
Stephan van Hulst
Saloon Keeper
Posts: 7487
135
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Because while the "Ж" string may appear to contain one character in your IDE, the Java compiler may interpret the source file using a different encoding, and the string may actually appear as a bunch of garbage characters that add up to 5 bytes when you encode it back to UTF-8.

You have 3 different encodings to consider here: The encoding your IDE uses to display the character on your screen, the encoding the compiler uses to interpret the source file, and the encoding that you tell String.getBytes() to use. If any of these don't match, you're going to end up with surprising results.

Try this: print "Ж".length() with and without specifying the source file encoding to your compiler, and see if the program really reports a string length of 1 character.
 
Aleksey Movchan
Ranch Hand
Posts: 42
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yeah, the length of any cyrillic letter is indeed 2 symbols without encoding.  
 
nikos vaggalis
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Aleksey Movchan wrote:Oh god, I fixed it. Kinda proud of myself

It's working when I use "javac -encoding UTF-8 Test.java", but the main problem was that my Windows 7 didn't recognize cyrillic symbols in text files.

Control Panel -> Region and Language -> Administrative -> Language for programs that do not support Unicode -> Russian



the problem is with the cmd console.I've writen an extensive article on its quirks which is targeted at Perl, although the concepts of the console are the same
Look in particular the 'Console Input and Output' section at the bottom of the link :
http://www.i-programmer.info/programming/other-languages/1973-unicode-issues-in-perl.html?start=1

"...we have to set the console to the correct codepage as well by using Win32::Console::OutputCP( 65001 ) and enable Unicode support by switching Win32::OLE to the UTF8 codepage (CP => Win32::OLE::CP_UTF8())."
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!