• Post Reply Bookmark Topic Watch Topic
  • New Topic

Encoding norwegian characters as UTF-8  RSS feed

 
Vijaishanker bala
Ranch Hand
Posts: 82
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi all,

I am trying to encode a String with the content, "Gratulerer! Du har n \u00e5" into norwegian, the \u00e5 which should be replaced by a "å" . I have tried the following,

public static String doEncode(String text) throws IOException{
return new String(text.getBytes(),"ISO-8859-1");
}


public static String doEncode(String text) throws IOException{
CharsetEncoder charSetEncode = Charset.forName("ISO-8859-1").newEncoder();
charSetEncode.reset();
ByteBuffer buffer = ByteBuffer.allocate(text.length());
charSetEncode.encode(CharBuffer.wrap(text.toCharArray()), buffer, true);
return new String(buffer.array());
}

both the above methods return the following, "Gratulerer! Du har n å"

If i replaced ISO-8859-1 with UTF8, I get "Gratulerer! Du har n å"

I run the program with a -Dfile.encoding=UTF8, jvm option so as to emulate the default encoding of glassfish. if the same option were set to use ISO8859_1, i get the expected behavior, but since UTF8 is a superset of ISO8859_1, I was hoping to get the same result. I am not permitted to change the encoding of glassfish, since certain other things on the UI get falsely displayed.

thanks,

vijai
 
James Sabre
Ranch Hand
Posts: 781
Java Netbeans IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Strings in Java are encoded as UTF-16. Always always always UTF-16. You cannot convert a String to a different encoding since they are always encoded as UTF-16.

Your code

return new String(text.getBytes(),"ISO-8859-1")

says take the String referenced by text and convert to bytes using your default encoding. Then, assume that those bytes are ISO-8859-1 and convet back to a String . If your default encoding is ISO-8859-1 and there are no characters in your string that cannot be represented in ISO-8859-1 then your new string will be exactly the same as the original - i.e. you have a null operation. If your default character encoding is not ISO-8859-1 then you will possibly (in your case certainly) corrupt the string.

You second approach has a similar problem.

If you just want the bytes of the utf-8 encoding then just use

byte[] utfBytesOfMyString = "my string".getBytes("utf-8");
 
Alan Moore
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Also, UTF-8 is not a superset of ISO-8859-1.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!