• Post Reply Bookmark Topic Watch Topic
  • New Topic

Convert 3 byte UTF-8 Japanese character into into HTML Entity

 
Evan Scharfer
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I have been looking around all over the place and haven't found a way to convert a UTF-8 String that contains Japanese character into HTML entities.

For example I have the character: あ

Which corresponds to a 3 Byte String with int values (227,129,130).

How do I convert these three integers into one large integer that is used as the unicode value for the HTML Enity?
The correct html entity value for this character is あ.

This seems easy in javascript but not Java which I need it for.




 
Paul Clapham
Sheriff
Posts: 21875
36
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, in Java there isn't any such thing as a "UTF-8 string". There are only strings of Unicode characters. In particular the character you have there is a Unicode character, as are all the other characters you used.

The UTF-8 aspect comes into play when you convert this string to an array of bytes using the UTF-8 charset. When you do that, the character you refer to will be converted to three bytes, true. But that has nothing to do with representing it as an HTML entity. So don't do that. Keep the data as a String.

As for converting that character to an HTML entity, why do you believe you have to do that? If you create an HTML document using the UTF-8 charset, you can put that character into the document directly. It doesn't have to be an entity.
 
Greg Charles
Sheriff
Posts: 3010
12
Firefox Browser IntelliJ IDE Java Mac Ruby
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, assuming you are converting the UTF8 in to Unicode for your Java Strings you can just use the Unicode values to build the entities. For your character, Hiragana Letter A, the Unicode hex value is 3042.

So & #x3042; = あ

(& #12354; also works, because 12354 is 0x3042 in base 10, so you can take your pick.)

 
Evan Scharfer
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
For more details:

The Japanese characters are getting submitted via a HTML form to a Java Servlet which encodes the the values to UTF-8. I need to store these values into the database and the easiest idea I had for this without changing database encoding was to just store the value in the database as its html entity because all I need it for is to display to the user on the webpage.

Greg how did you get the 3042 from my 3 bytes? Can you give me a quick algorithm if there is one.

Say.. // String HLetterA contains the UTF-8 representation of あ

byte[] theBtyes = HLetterA.getBytes().

Is it possible to take those bytes and get that Hex number or base 10 number with a simple algorithm?

Thanks,
Evan

 
Paul Clapham
Sheriff
Posts: 21875
36
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The Wikipedia article on UTF-8 ought to have the algorithm. But that algorithm is already built into the Java code which converts an array of bytes to a String using UTF-8, so why don't you just convert your array of bytes to a String?

I'm still not persuaded that you need an entity. You certainly don't need one to display that character in an HTML document.

For amusement, have a look at this blog post to see the havoc that can be wreaked with HTML entities.
 
Paul Clapham
Sheriff
Posts: 21875
36
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Evan Scharfer wrote:Say.. // String HLetterA contains the UTF-8 representation of あ

byte[] theBtyes = HLetterA.getBytes().

Is it possible to take those bytes and get that Hex number or base 10 number with a simple algorithm?


As I said, the String will not contain the UTF-8 representation of that character. Or if it does, you did something wrong to make that happen, in which case you should stop this train of thought and go back to fix what you did wrong. But let's assume you have the 1-character string containing that character. In which case you don't need to convert it to bytes. You especially don't need to convert it to bytes using your system's default encoding, which might not be UTF-8. You just need to look at the Unicode value of the character in it. Like this:
 
Evan Scharfer
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for the help. I figured my String was already in UTF-8 because I changed a lot of settings to do that.

However, it was still in ISO-8859-1 and my default system encoding is even different so my getBytes was never working correctly.

String utfHLetterA = new String(HLetterA.getBytes("8859_1"),"UTF8");

Did the trick.

Thanks again,
Evan
 
James Sabre
Ranch Hand
Posts: 781
Java Netbeans IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That solution is flawed. For that solution to be necessary implies that when you created the String referenced by HLetterA you used ISO8859-1 character decoding of instead of UTF-8. You need to correct the string generation at source not add a bodge to correct the original problem.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!