Forums Register Login

Fun (well, not really..) with character encoding

+Pie Number of slices to send: Send
Here's the scenerio:
I'm modifying (really just doing some find and replace) and saving XML files (at this stage, they're just text files for all intents and purposes - I'm not using any XML APIs), which are then used as input for XSLT transformation using the Saxon 6.5.5 processor.

Here's the problem:
If I use the following code to write the file:
FileWriter writer = new FileWriter(new File(outDir, inFile.getName()));
writer.write(tempString);
writer.flush();
writer.close();

when I later go to do the XSLT transformation on one particular file (it doesn't show up on any others.. reason unknown) I get the following exception:
Error at byte 122254 of file:/C:/testboardno/test_in/c14404000mn024/c14404000mn024/WP130700.xml:
Error reported by XML parser: bad continuation of multi-byte UTF-8 sequence (code: 0x3f)
javax.xml.transform.TransformerException: org.xml.sax.SAXParseException: bad continuation of multi-byte UTF-8 sequence (code: 0x3f)
at com.icl.saxon.om.Builder.build(Builder.java:273)
at com.icl.saxon.Controller.transform(Controller.java:977)


I guessed that the problem is that the FileWriter wasn't outputting with the correct character encoding, as all of the XML files are declared as using UTF-8:
<?xml version="1.0" encoding="UTF-8"?>


So I tried replacing the FileWriter with the following code:
OutputStreamWriter writer = new OutputStreamWriter(new BufferedOutputStream( new FileOutputStream( new File(outDir, inFile.getName()) )), "UTF8");

And, in fact, this does solve the problem. The XSLT processor does not throw an exception. However, this creates a different problem. For some reason, whenever there is a "degree" character in the original file, ex:
�C
in the output, I get the following:
��C

Any ideas what's going on here, and how I can make this work?
BTW: Using java 1.4.2 on windows XP.
+Pie Number of slices to send: Send
The degree Celsius character is \U2103 in Unicode, so it can't be represented in 8 bits. That means it will occupy two characters in the file. This is how UTF-8 works, and shouldn't cause a problem.
+Pie Number of slices to send: Send
 

Originally posted by Ulf Dittmer:
The degree Celsius character is \U2103 in Unicode, so it can't be represented in 8 bits. That means it will occupy two characters in the file. This is how UTF-8 works, and shouldn't cause a problem.


It's actually just the degree character - I guess my example was a bit unintentionally misleading.

The original XML file (generated by Adobe Framemaker) is encoded (or, at least, is declared to be encoded.. maybe that's the problem..) in UTF-8, and this character shows up fine in it..

[ UD: fixed URL ]
[ March 06, 2008: Message edited by: Ulf Dittmer ]
+Pie Number of slices to send: Send
While that character has a code below 256, it nonetheless has a two-byte UTF-8 representation: 0xC2 0xB0. And C2 is the  character that you're seeing. So I'd say everything is still good, and you're looking at this using a software that's not up to snuff Unicode-wise.
+Pie Number of slices to send: Send
Figured out what was going wrong - I need to read the file in using UTF-8 as well.

I had:
BufferedReader reader = new BufferedReader(new FileReader(inFile));
which appears to be what was mangling the UTF-8 characters.

replaced with:
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(inFile), "UTF8"));

It all makes so much sense now
But how did the elephant get like that? What did you do? I think all we can do now is read this tiny ad:
a bit of art, as a gift, that will fit in a stocking
https://gardener-gift.com


reply
reply
This thread has been viewed 2342 times.
Similar Threads
problem with characters é,ã and º
UTF8 Encoding While Writing in File - Out Of MemoryError
1 Character seems to be written as one byte
generating valid xml with irish fada
XML Validation
More...

All times above are in ranch (not your local) time.
The current ranch time is
Apr 16, 2024 09:58:15.