• Post Reply Bookmark Topic Watch Topic
  • New Topic

Character encoding question  RSS feed

 
William Alfred
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is (hopefully) a really simple question, but there is such a plethora of information on this topic (a lot of it seemingly suspect), that it's hard to separate the good from the bad.

In a nutshell, I need to read text encoded in ISO-8859-1 and save it in a database as UTF-8.

Specifically, I have an xml file that begins with:

<?xml version="1.0" encoding="ISO-8859-1" ?>

I am parsing it like so:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse ( "test.xml" );

I am writing to a mysql database, which I am opening like so:

conn = DriverManager.getConnection (
"jdbc:" + "mysql://" + host + "/" + db
+ "?useUnicode=yes&characterEncoding=UTF-8"
+ "&user=" + user + "&password=" + pass );

which should take care of the database end of things (I think).

What happens in the middle is what concerns me -- how do I convert what I am reading from the ISO-8859-1 encoded xml into strings that can be correctly inserted into my tables?

From what I understand, such a conversion should be possible and perhaps simple -- what I'm looking for is a good idiomatic way of getting the job done.

Thanks in advance for any advice!!
 
Paul Clapham
Sheriff
Posts: 22507
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
"Stephen Dedalus", please check your private messages regarding an important administrative matter.

Thank you.
 
Paul Clapham
Sheriff
Posts: 22507
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That particular question is simple: It's the responsibility of the XML parser to convert from bytes to chars, using the encoding specified in the XML document, if it can.

And since you have just passed the name of the file to the parser, it can read the document, discover the encoding, and then continue to read the file using that encoding. However you could have interfered with the process by passing, for example, a FileReader to the parser. If that FileReader happened to use the wrong encoding, then problems might arise.

In general, apart from the XML context, you use an InputStreamReader to convert an InputStream from bytes to chars, and you provide that InputStreamReader with the desired encoding. There are commonly-used ways to avoid that decision and to just use the system's default encoding, such as the FileReader I mentioned above. That isn't always a good thing, particularly with XML documents whose encoding doesn't match that default.

You should really read Oracle's I/O tutorial, particularly the introductory sections about bytes streams and character streams.
 
William Alfred
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Perfect -- that's what I was hoping. As a matter of fact, yes, I'm just passing the name of the file to the parser. Since the encoding is explicitly specified in the xml declaration, (and since it's a well known one), it looks as if I don't need to do anything.

And thanks for the links -- they are quite helpful.

Cheers!
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!