Win a copy of Functional Reactive Programming this week in the Other Languages forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

multiple language support in one XML

 
Yan Zhou
Ranch Hand
Posts: 137
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

is it possible to support multiple languages in one XML, e.g., Japanese and Chinese text in the same XML? If so, how would I specify encoding type in that XML, simply UTF-8?

The reason of this question is because I run into issues in dealing with Chinese text in my Java program. I use latest JAXB as XML parser, and store the text in a UNICODE (UTF-8) PostgreSQL database (latest version).

As I type in Chinese text, JAXB has no problem marshalling my text into a XML string with encoding type set to UTF-8, and my code successfully saves the XML text into the database; but when reading out, the JAXB Unmarshaller gives error: invalid byte 2 of 3 byte UTF-8 character, on the XML string I just read from DB.

The first question is, if both my XML and DB specify encoding type being UTF-8, why am I still having problem parsing the XML text?

Someone mentioned that I have to tell the parser the character set I used, which is "GB2312". Just because JAXB supports UTF-8, does not mean it knows how to convert Chinese text into UTF-8. Once I changed the encoding to GB2312, the program worked, reading out XML text had no problem.

However, my question continues, if I need to support both Japanese and Chinese text in the same XML, how do I specify the encoding type since now I have two different encoding. Do I have to convert my text into UTF-8 myself and set XML encoding as UTF-8?

Another question is, what is the relationship between UTF-8 and all the character sets (GB2312, Big5, etc.)

Since a XML file must be in one of the languages, therefore, a XML file must use one of the character sets, and in turn, the encoding attribute in XML must be the character set, NOT "UTF-8" (since the parser does not know how to convert characters into UTF-8 without knowing the character set in use). If so, when would we ever use "UTF-8" in our XML for encoding?

Thanks.
Yan
 
Yan Zhou
Ranch Hand
Posts: 137
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
another issue I do not understand is, if UTF-8 should not be used when I am inputting Chinese text (use GB2312 instead), why JAXB does not report error when marshalling the text, only does so when unmarshalling them?

Thanks.
Yan
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic