NLS characters lost when storing xml from java to filesystem

Jun 14, 2007 07:03:00

Hi,

I am storing a XML document which has some non-ascii characters, from java to filesytem. I specify the XML document's encoding as UTF-8 and then save it to the filesystem.
But when i retrieve the document back in to java i find all the non-ascii characters lost and represented as [???].

What can be done in such scenarios? My guess is the hosting machine where the XML document is stored, should have default encoding as UTF-8 ? Please comment and guide.

If the above is true, it would be a pain to do these settings on all the machine where the application is run.

I expect there should be a easier solution for this. Any ideas, help is appreciated.

Just to mention, this is all in case when the input to my api's is a 'reader' and i read the xml from the reader. I do not have control over this.
[ June 14, 2007: Message edited by: Anupam Bhatt ]

Jun 14, 2007 08:40:00

The Reader needs to know the encoding if UTF-8 isn't the platform default; it won't learn this from the XML file itself. For example, you might use

Reader rdr = new InputStreamReader(new FileInputStream("filename"), "UTF-8");

There's no way to tell FileReader the encoding, alas.

Jun 14, 2007 15:20:00

[EFH]: There's no way to tell FileReader the encoding, alas.

True. Though since JDK 5, it's been possible to use a Scanner instead, which allows you to specify an encoding quite easily.

However, since the goal is to read an XML file, I think it would probably be more useful to use an existing parser, such as Xerces or JDOM. XML parsers are responsible for reading the encoding specified within the document, and using it. As well as for handling many other tasks which are probably more trouble than they're worth. There's no need to assume that the documents will all use UTF-8; let the document specify it, and let the parser parse it.

Jun 14, 2007 15:39:00

Originally posted by Jim Yingst:
There's no need to assume that the documents will all use UTF-8; let the document specify it, and let the parser parse it.

That's the ideal way to do it; give the parser a stream of bytes and let it deal with it in the standard XML way. But Anupam said

Just to mention, this is all in case when the input to my api's is a 'reader' and i read the xml from the reader. I do not have control over this.

And the problem occurs when the reader has been created with an encoding that conflicts with the document's real encoding. So I would suggest that design is a problem. It should be fixed by changing the design to accept an InputStream.

I realize that it's common for lower-level staff to be made to work with "fait accompli" designs like this. Often this results in their producing convoluted work-arounds that were not foreseen by the designers. But I would prefer well-written software to software that places more importance on the office power structure. So: get the design changed.

Time is mother nature's way of keeping everything from happening at once. And this is a tiny ad:

a bit of art, as a gift, that will fit in a stocking

https://gardener-gift.com