• Post Reply Bookmark Topic Watch Topic
  • New Topic

Frustration of encoding issues in Java  RSS feed

 
Taka Chan
Ranch Hand
Posts: 33
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have seen some technical articles say that Java is using UTF-16 as its internal encoding. And I am getting frustrated about the usage of getBytes(String charset) and the contructor of the String class, String(byte[],chartset), so I would like to make it clear.

For getBytes(String charset), the javadoc says that it will return a byte array using the specified charset to encode. Does it mean that if I have a string in big5 encoding, when I execute the statement str.getBytes("BIG5"), I am telling the jvm that the string is in big5 encoding and it will convert the string from big5 to UTF-16 and then store it in memory in UTF-16 format? Or it means that the resulting byte[] is in Big5 format?

Furthermore, if I have another string in big5, and I have a database whose encoding is UTF-8, is the following statement correct so that I can store the string in database properly in UTF-8 format?

new String(str.getBytes("Big5"), "UTF-8");

I am really feel frustrated and maybe I am not asking my question clearly, I apologise for any inconvenience caused. Hope someone can clear my frustration, thanks a lot.
[ July 04, 2006: Message edited by: Taka Chan ]
 
Peter Chase
Ranch Hand
Posts: 1970
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Java used to store Strings in fixed-size 16-bit Unicode. However, Unicode itself has advanced and now supports far more characters than 16 bits can fit. Therefore, Java has switched to UTF-16 encoding internally.

However, most of the time, one does not need to care about how Java stores its Strings internally. The important thing is to use the right encoding when transferring Strings between Java applications and other applications.

You have correctly identified the String.getBytes(String encoding) method and the String(String encoding) constructor as being important in converting Java Strings to and from other encodings.

If you have some bytes containing characters in, say, UTF-8 encoding, you use the String("UTF-8") constructor to make a Java String from them. If you have a Java String and you want it in, say, UTF-8 encoding, you use the getBytes("UTF-8") method to return the bytes.
 
Taka Chan
Ranch Hand
Posts: 33
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Another question, if I have a string, but I do not know its encoding. And I want it to be converted to UTF-8. Is that I can achieve this by the following statement?

String finalStr = new String(originalStr.getBytes("UTF-8"));

Or

String finalStr = new String(originalStr.getBytes("UTF-8"), "UTF-8");

Default system encoding is ISO8859-1 and assume that my original string can be converted to UTF-8 properly, says maybe the encoding is big5 or gbk.

Am I correct? Is that I can be guranteed to get a resulting string in UTF-8 by the statements above? Thank you very much.
 
Anju sethi
Ranch Hand
Posts: 91
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes, U are correct. We convert non-unicode charset to unicode by passing it in the string constructor and unicode to non-unicode by using getBytes method.

Refer this link:
http://java.sun.com/docs/books/tutorial/i18n/text/string.html
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!