Win a copy of TensorFlow 2.0 in Action this week in the Artificial Intelligence and Machine Learning forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Paul Clapham
  • Bear Bibeault
  • Jeanne Boyarsky
Sheriffs:
  • Ron McLeod
  • Tim Cooke
  • Devaka Cooray
Saloon Keepers:
  • Tim Moores
  • Tim Holloway
  • Jj Roberts
  • Stephan van Hulst
  • Carey Brown
Bartenders:
  • salvin francis
  • Scott Selikoff
  • fred rosenberger

multiple language support in one XML

 
Ranch Hand
Posts: 137
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

is it possible to support multiple languages in one XML, e.g., Japanese and Chinese text in the same XML? If so, how would I specify encoding type in that XML, simply UTF-8?

The reason of this question is because I run into issues in dealing with Chinese text in my Java program. I use latest JAXB as XML parser, and store the text in a UNICODE (UTF-8) PostgreSQL database (latest version).

As I type in Chinese text, JAXB has no problem marshalling my text into a XML string with encoding type set to UTF-8, and my code successfully saves the XML text into the database; but when reading out, the JAXB Unmarshaller gives error: invalid byte 2 of 3 byte UTF-8 character, on the XML string I just read from DB.

The first question is, if both my XML and DB specify encoding type being UTF-8, why am I still having problem parsing the XML text?

Someone mentioned that I have to tell the parser the character set I used, which is "GB2312". Just because JAXB supports UTF-8, does not mean it knows how to convert Chinese text into UTF-8. Once I changed the encoding to GB2312, the program worked, reading out XML text had no problem.

However, my question continues, if I need to support both Japanese and Chinese text in the same XML, how do I specify the encoding type since now I have two different encoding. Do I have to convert my text into UTF-8 myself and set XML encoding as UTF-8?

Another question is, what is the relationship between UTF-8 and all the character sets (GB2312, Big5, etc.)

Since a XML file must be in one of the languages, therefore, a XML file must use one of the character sets, and in turn, the encoding attribute in XML must be the character set, NOT "UTF-8" (since the parser does not know how to convert characters into UTF-8 without knowing the character set in use). If so, when would we ever use "UTF-8" in our XML for encoding?

Thanks.
Yan
 
Yan Zhou
Ranch Hand
Posts: 137
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
another issue I do not understand is, if UTF-8 should not be used when I am inputting Chinese text (use GB2312 instead), why JAXB does not report error when marshalling the text, only does so when unmarshalling them?

Thanks.
Yan
 
Consider Paul's rocket mass heater.
reply
    Bookmark Topic Watch Topic
  • New Topic