• Post Reply Bookmark Topic Watch Topic
  • New Topic

java String UTF8  RSS feed

 
Edward Chen
Ranch Hand
Posts: 798
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator


My goal is to get a String and convert to UTF8.
1. The above way is wrong. See the comment
2. I can't set my own default locale.
3. Before we change it into UTF8, we should know the string's orginal encoding . But how could I know this ?

Thanks
 
Francis Shillitoe
Greenhorn
Posts: 22
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Strings in java are always stored in unicode UCS-2 (also know as UTF-16). When you ask how can you determine the encoding of a String, I assume you mean some series of bytes in a file. Unfortunatley, there is no way to determine this from the bytes alone, you have to know the character encoding used to encode the characters into bytes. To get non-ascii characters into a String in a java source file you can use \u. Character sets are simply mappings between a number and a character (e.g. Unicode). Character encoding are mappings between this number and a sequence of bytes (e.g. UTF-8, UTF-16).

String myString = "\u0048\u0065\u006C\u006C\u006F World";
System.out.println(myString);
byte[] myBytes = null;

try
{
myBytes = myString.getBytes("UTF-8");
} catch (UnsupportedEncodingException e)
{
e.printStackTrace();
System.exit(-1);
}

for (int i=0; i < myBytes.length; i++) {
System.out.println(myBytes[i]);
}


Francis
 
Edward Chen
Ranch Hand
Posts: 798
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks.

if I have a string like

String aa = new String(" \u67e5\u770b\u5168\u90e8");
System.out.println(aa);

Sometimes, the system output the UTF code, sometime it output the real Chinese character. It looks weird. Why ?

2. The UTF coding is unique in any system ? No matter what OS, what locale, a Chinese character should have same one UTF code ? This concept is correct ?

3. The unicode and UTF8 are different concepts ? In my understanding, UTF8 is A kind of unicode . Is it right ?

Thanks.
 
Francis Shillitoe
Greenhorn
Posts: 22
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
UTF-8 is not Unicode, it is a way of encoding unicode. See:

http://www.cl.cam.ac.uk/%7Emgk25/unicode.html#unicode

for a good explanation of the differences.

If you are finding that on one system your program is working correctly and outputting chinese characters, and on another it is not (maybe it is printing empty squares or question marks), this is almost certainly a font issue. You need to have a unicode font installed (such as the Microsoft Arial Unicode font available on an MS Office CD), to see the full range of characters in a UTF-8 encoded file.

All these sorts of issues are covered under the subject of Intenationalization (I18N). This is a good site on the subject:

http://www.joconner.com/javai18n/

regards,

Francis
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!