• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

getBytes() with UTF-8 and UTF-16

 
Ranch Hand
Posts: 58
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hello.

Could anyone explain please why



displays "1" (not using BOM), but



displays "4" (using BOM FEFF + 0041  I suppose).
 
Saloon Keeper
Posts: 15484
363
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
In both UTF-8 and UTF-16, the character 'A' is encoded using 1 code unit. UTF-8 uses 1 byte per code unit, and UTF-16 uses 2 bytes per code unit. The "UTF-16" encoding also adds a two byte BOM to the start.

If you don't want to use the BOM, you have to use either "UTF-16LE", or "UTF-16BE".
 
Marshal
Posts: 79151
377
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
...and welcome to the Ranch

This discussion is too difficult for the “Beginning” forum, so I shall move it.
 
Oleksii Movchan
Ranch Hand
Posts: 58
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thank you

But why then some other symbols need so many bytes?


The result is "5" in this case.
 
Stephan van Hulst
Saloon Keeper
Posts: 15484
363
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
It doesn't. The Cyrillic letter Che uses 2 code units in UTF-8.

UTF-8 uses a variable number of bytes to encode characters. If it only needed one byte per character, it would only be able to map 256 characters. Instead, it uses an elaborate scheme that uses more bytes for characters that are less widely used. For instance, ASCII characters are represented by one byte, while musical symbols require four bytes. Cyrillic requires 2 bytes.
 
Oleksii Movchan
Ranch Hand
Posts: 58
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:It doesn't. The Cyrillic letter Che uses 2 code units in UTF-8.



Well, it does on my PC.
And the strange thing that sometimes it uses 5 bytes, sometimes 4 (13 or 12 bytes for "ЧЧЧ"). I'm quite sure results were different when I compiled it last time.

 
Oleksii Movchan
Ranch Hand
Posts: 58
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I found the problem: Java doesn't recognize my Cyrillic letters
Does anyone know how to fix that?

 
Oleksii Movchan
Ranch Hand
Posts: 58
1
  • Likes 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Oh god, I fixed it. Kinda proud of myself

It's working when I use "javac -encoding UTF-8 Test.java", but the main problem was that my Windows 7 didn't recognize cyrillic symbols in text files.

Control Panel -> Region and Language -> Administrative -> Language for programs that do not support Unicode -> Russian

 
Campbell Ritchie
Marshal
Posts: 79151
377
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Aleksey Movchan wrote:. . . my Windows 7 didn't recognize cyrillic symbols in text files. . . .

A common problem with the Windows® command line; it only supports a very restricted range of characters, not the same as this Unicode page plus ASCII.
Try the following instead of System.out.println:-
JOptionPane.showMessageDialog(null, "АБВГДЕ");
You have to import option pane: import javax.swing.JOptio‍nPane;
 
Stephan van Hulst
Saloon Keeper
Posts: 15484
363
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Great work, have a cow!

After fixing it, does the program return 2 bytes for Cyrillic characters in UTF-8?

That it returns 4 is strange and shouldn't happen, but that it returns 5 should be downright impossible because the character ranges that require 5 and 6 bytes in UTF-8 are simply not defined.
 
Oleksii Movchan
Ranch Hand
Posts: 58
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks a lot!

I changed my "Region and Language" standards back to demonstrate you how it works on my computer with and without "-encoding UTF-8" flag.
Somehow it uses 5 bytes for letter "Ж":

 
Stephan van Hulst
Saloon Keeper
Posts: 15484
363
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Because while the "Ж" string may appear to contain one character in your IDE, the Java compiler may interpret the source file using a different encoding, and the string may actually appear as a bunch of garbage characters that add up to 5 bytes when you encode it back to UTF-8.

You have 3 different encodings to consider here: The encoding your IDE uses to display the character on your screen, the encoding the compiler uses to interpret the source file, and the encoding that you tell String.getBytes() to use. If any of these don't match, you're going to end up with surprising results.

Try this: print "Ж".length() with and without specifying the source file encoding to your compiler, and see if the program really reports a string length of 1 character.
 
Oleksii Movchan
Ranch Hand
Posts: 58
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yeah, the length of any cyrillic letter is indeed 2 symbols without encoding.  
 
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Aleksey Movchan wrote:Oh god, I fixed it. Kinda proud of myself

It's working when I use "javac -encoding UTF-8 Test.java", but the main problem was that my Windows 7 didn't recognize cyrillic symbols in text files.

Control Panel -> Region and Language -> Administrative -> Language for programs that do not support Unicode -> Russian



the problem is with the cmd console.I've writen an extensive article on its quirks which is targeted at Perl, although the concepts of the console are the same
Look in particular the 'Console Input and Output' section at the bottom of the link :
http://www.i-programmer.info/programming/other-languages/1973-unicode-issues-in-perl.html?start=1

"...we have to set the console to the correct codepage as well by using Win32::Console::OutputCP( 65001 ) and enable Unicode support by switching Win32::OLE to the UTF8 codepage (CP => Win32::OLE::CP_UTF8())."
 
New rule: no elephants at the chess tournament. Tiny ads are still okay.
a bit of art, as a gift, that will fit in a stocking
https://gardener-gift.com
reply
    Bookmark Topic Watch Topic
  • New Topic