• Post Reply Bookmark Topic Watch Topic
  • New Topic

Confused over Unicode  RSS feed

 
Peter Chase
Ranch Hand
Posts: 1970
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have got confused about Unicode, UTF-16 encoding, Java Strings and Java chars. I thought my code was wrong, then I thought not, then I was not sure.

The following code is supposed to convert any Java String into UTF-16, where (for reasons specific to my project) each 16-bit value is a Java short, not the more usual Java char.

 
John Dell'Oso
Ranch Hand
Posts: 130
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Peter,

I have changed your method to the following (sorry, the code is a little messy and I got a little lazy with catching the UnsupportedEncodingException that can be thrown by the getBytes method, so I'm just throwing the Exception - but hopefully you get the picture):



For example if you pass the string "ABCD", the method will return the following array of shorts:
-2
-1
0
65
0
66
0
67

The -2 and -1 values represent the xFEFF byte order mark which is big-endian UTF-16. If you don't want the byte order mark, then change the encoding in the getBytes method to "UTF-16BE".

Is this the sort of thing you were looking for?

Regards,
JD
 
Peter Chase
Ranch Hand
Posts: 1970
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for replying, but I'm fairly sure that's not what I need.

Your method converts to bytes then puts each byte into a short, doesn't it? So you have two bytes for each UTF-16 16-bit code point.

What I want is an array of shorts, where each 16-bit short represents a UTF-16 16-bit code point.

My code achieves that in all the cases I've actually seen, but I am wondering about cases where UTF-16 does not translate a single Unicode character into exactly one Java char.
 
Ernest Friedman-Hill
author and iconoclast
Sheriff
Posts: 24217
38
Chrome Eclipse IDE Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I believe that what you've done is correct. String.length() has been redefined as the number of 16-bit char values it takes to represent the String in UTF-16; in the cases you're worried about it this number is larger than the number of code points ("characters") in the string, which you can get from codePointCount(). charAt() returns the 16-bit value at the given index, which might be one of a pair of surrogates. Your code is doing the right thing: each of the two members of a surrogate pair will be stored in a separate adjacent short.
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The http://faq.javaranch.com/java/JavaIoFaq points to two blog entries talking about UC characters not in the BMP; they also include some code. Maybe that's what you're looking for?
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!