Win a copy of Cross-Platform Desktop Applications: Using Node, Electron, and NW.js this week in the JavaScript forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Fun (well, not really..) with character encoding  RSS feed

 
Adam Schweitzer
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Here's the scenerio:
I'm modifying (really just doing some find and replace) and saving XML files (at this stage, they're just text files for all intents and purposes - I'm not using any XML APIs), which are then used as input for XSLT transformation using the Saxon 6.5.5 processor.

Here's the problem:
If I use the following code to write the file:
FileWriter writer = new FileWriter(new File(outDir, inFile.getName()));
writer.write(tempString);
writer.flush();
writer.close();

when I later go to do the XSLT transformation on one particular file (it doesn't show up on any others.. reason unknown) I get the following exception:
Error at byte 122254 of file:/C:/testboardno/test_in/c14404000mn024/c14404000mn024/WP130700.xml:
Error reported by XML parser: bad continuation of multi-byte UTF-8 sequence (code: 0x3f)
javax.xml.transform.TransformerException: org.xml.sax.SAXParseException: bad continuation of multi-byte UTF-8 sequence (code: 0x3f)
at com.icl.saxon.om.Builder.build(Builder.java:273)
at com.icl.saxon.Controller.transform(Controller.java:977)


I guessed that the problem is that the FileWriter wasn't outputting with the correct character encoding, as all of the XML files are declared as using UTF-8:
<?xml version="1.0" encoding="UTF-8"?>


So I tried replacing the FileWriter with the following code:
OutputStreamWriter writer = new OutputStreamWriter(new BufferedOutputStream( new FileOutputStream( new File(outDir, inFile.getName()) )), "UTF8");

And, in fact, this does solve the problem. The XSLT processor does not throw an exception. However, this creates a different problem. For some reason, whenever there is a "degree" character in the original file, ex:
�C
in the output, I get the following:
��C

Any ideas what's going on here, and how I can make this work?
BTW: Using java 1.4.2 on windows XP.
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The degree Celsius character is \U2103 in Unicode, so it can't be represented in 8 bits. That means it will occupy two characters in the file. This is how UTF-8 works, and shouldn't cause a problem.
 
Adam Schweitzer
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Ulf Dittmer:
The degree Celsius character is \U2103 in Unicode, so it can't be represented in 8 bits. That means it will occupy two characters in the file. This is how UTF-8 works, and shouldn't cause a problem.

It's actually just the degree character - I guess my example was a bit unintentionally misleading.

The original XML file (generated by Adobe Framemaker) is encoded (or, at least, is declared to be encoded.. maybe that's the problem..) in UTF-8, and this character shows up fine in it..

[ UD: fixed URL ]
[ March 06, 2008: Message edited by: Ulf Dittmer ]
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
While that character has a code below 256, it nonetheless has a two-byte UTF-8 representation: 0xC2 0xB0. And C2 is the  character that you're seeing. So I'd say everything is still good, and you're looking at this using a software that's not up to snuff Unicode-wise.
 
Adam Schweitzer
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Figured out what was going wrong - I need to read the file in using UTF-8 as well.

I had:
BufferedReader reader = new BufferedReader(new FileReader(inFile));
which appears to be what was mangling the UTF-8 characters.

replaced with:
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(inFile), "UTF8"));

It all makes so much sense now
 
Don't get me started about those stupid light bulbs.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!