• Post Reply Bookmark Topic Watch Topic
  • New Topic

Unable to correctly read UTF-8  RSS feed

 
Richard Hayward
Ranch Hand
Posts: 187
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
To experiment with UTF-8, I have a file 'testfile.utf8'  consisting of hex:
41 C2 A3 E0 A4 85 F0 90 84 B7



The file consists of 4 characters of length 1 byte, 2 bytes, 3 bytes & 4 bytes.

hexcodepointcharacter
41U+0041LATIN CAPITAL LETTER A
C2 A3U+00A3POUND SIGN
E0 A4 85U+0905DEVANAGARI LETTER A
F0 90 84 B7U+10137AEGEAN WEIGHT BASE UNIT


I found this page UTF-8 table handy for checking these.

To read the file & display its characters I wrote the following


The first 3 characters get correctly displayed.


but the 4th character, which should be occupying 4 bytes isn't getting read as a single character, seemingly being mistakenly read as two.

Could anyone tell me what I'm doing wrong?
 
Carey Brown
Saloon Keeper
Posts: 3315
46
Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Your hex print could have been simplified to


Have you tried to read the 10 bytes of your file as bytes and printing them out to see if they are what you think? Or use a hexdump utility?
 
Richard Hayward
Ranch Hand
Posts: 187
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Carey Brown wrote:Your hex print could have been simplified to


Thanks, that makes the code simpler. I was unaware of that formatting conversion.
Carey Brown wrote:
Have you tried to read the 10 bytes of your file as bytes and printing them out to see if they are what you think? Or use a hexdump utility?

Yes, my first screenshot showed the output from the  linux xxd command.
Plus, I was working with the file in a hex editor.

 
Carey Brown
Saloon Keeper
Posts: 3315
46
Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
A partial clue.
The four hex values are the same as if the character (d800dd37) was read in with UTF-16. Curious, have you tried reading it in as UTF-16  just to see?
 
Norm Radder
Rancher
Posts: 2240
28
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How would the last char held in 4 bytes that maps to 21 bits fit in a single unicode character?

Did you map the bits for the first three characters?  Did they map correctly?
I used the mapping from: https://en.wikipedia.org/wiki/UTF-8
 
Richard Hayward
Ranch Hand
Posts: 187
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Norm Radder wrote:How would the last char held in 4 bytes that maps to 21 bits fit in a single unicode character?

From wikipedia, unicode characters with code points in the range U+10000 -> U+10FFFF are held in 4 bytes:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The last 4 bytes in my file are, in both hex & binary format

F09084B7
1111 00001001 00001000 01001011 0111
1111 0xxx10xx xxxx10xx xxxx10xx xxxx

So, the 21 bits marked x correspond to
000010000000100110111 = 10137 (hex)

It's U+10137 that I was expecting to read from the file, in those 4 bytes.

Or is that not what you were asking?

Actually, the last code point for UTF-8 4 byte characters is given at wikipedia as U+10FFFF. A tutorial on youtube gives the last code point as U+1FFFFF. Not sure yet which is correct, but I don't think that matter has a bearing on my problem.
 
Norm Radder
Rancher
Posts: 2240
28
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
A unicode character holds 16 bits.  How would the 21 bits from the 4 bytes be placed in the 16 bit unicode char?

00001 0000 0001 0011 0111 = 1 0137 (hex)  

How is the leading 1 held?  How does unicode specify that there are two char values needed to hold the one char that came from the 4 bytes utf8?
 
Richard Hayward
Ranch Hand
Posts: 187
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Norm Radder wrote:A unicode character holds 16 bits.

I don't think that's true in the case of UTF-8 which is a variable length encoding.
The letter A,  code point = U+0041 for example, only needs a single byte.

 
Norm Radder
Rancher
Posts: 2240
28
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think unicode chars use 16 bits/2 bytes.  What happens when the char requires more bits like the 4 byte UTF8 char?

Read the API doc for the Character class.
 
Richard Hayward
Ranch Hand
Posts: 187
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Norm Radder wrote:What happens when the char requires more bits like the 4 byte UTF8 char?

The leading 1111 0 bits of the first byte indicate that the character is going to use 4 bytes.
The leading 10 bits of the following 3 bytes indicate that it's a continuation.

Hence the notation
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

I think such a system can continue up to a length of 6 bytes.
youtube tutorial
 
Richard Hayward
Ranch Hand
Posts: 187
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Norm Radder wrote:I think unicode chars use 16 bits/2 bytes..


Ah, the java char datatype is 16 bit!

I get it.
Norm & Carey, thanks to both of you for your help!
 
Campbell Ritchie
Marshal
Posts: 56536
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Richard Hayward wrote:. . . the java char datatype is 16 bit! . . .
I believe that Java® Strings default to an encoding called UTF-16. Not certain however.
 
Stephan van Hulst
Saloon Keeper
Posts: 7973
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes, chars in Java are ALWAYS UTF-16 Big Endian.

To output Java Strings with a specific encoding, you need to use a writer that's configured to use that encoding.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!