Forums Register Login

NLS characters lost when storing xml from java to filesystem

+Pie Number of slices to send: Send
Hi,

I am storing a XML document which has some non-ascii characters, from java to filesytem. I specify the XML document's encoding as UTF-8 and then save it to the filesystem.
But when i retrieve the document back in to java i find all the non-ascii characters lost and represented as [???].

What can be done in such scenarios? My guess is the hosting machine where the XML document is stored, should have default encoding as UTF-8 ? Please comment and guide.

If the above is true, it would be a pain to do these settings on all the machine where the application is run.

I expect there should be a easier solution for this. Any ideas, help is appreciated.

Just to mention, this is all in case when the input to my api's is a 'reader' and i read the xml from the reader. I do not have control over this.
[ June 14, 2007: Message edited by: Anupam Bhatt ]
+Pie Number of slices to send: Send
The Reader needs to know the encoding if UTF-8 isn't the platform default; it won't learn this from the XML file itself. For example, you might use

Reader rdr = new InputStreamReader(new FileInputStream("filename"), "UTF-8");

There's no way to tell FileReader the encoding, alas.
+Pie Number of slices to send: Send
[EFH]: There's no way to tell FileReader the encoding, alas.

True. Though since JDK 5, it's been possible to use a Scanner instead, which allows you to specify an encoding quite easily.

However, since the goal is to read an XML file, I think it would probably be more useful to use an existing parser, such as Xerces or JDOM. XML parsers are responsible for reading the encoding specified within the document, and using it. As well as for handling many other tasks which are probably more trouble than they're worth. There's no need to assume that the documents will all use UTF-8; let the document specify it, and let the parser parse it.
+Pie Number of slices to send: Send
 

Originally posted by Jim Yingst:
There's no need to assume that the documents will all use UTF-8; let the document specify it, and let the parser parse it.

That's the ideal way to do it; give the parser a stream of bytes and let it deal with it in the standard XML way. But Anupam said

Just to mention, this is all in case when the input to my api's is a 'reader' and i read the xml from the reader. I do not have control over this.

And the problem occurs when the reader has been created with an encoding that conflicts with the document's real encoding. So I would suggest that design is a problem. It should be fixed by changing the design to accept an InputStream.

I realize that it's common for lower-level staff to be made to work with "fait accompli" designs like this. Often this results in their producing convoluted work-arounds that were not foreseen by the designers. But I would prefer well-written software to software that places more importance on the office power structure. So: get the design changed.
Time is mother nature's way of keeping everything from happening at once. And this is a tiny ad:
a bit of art, as a gift, that will fit in a stocking
https://gardener-gift.com


reply
reply
This thread has been viewed 1088 times.
Similar Threads
XML parse error
Entities in attribute values issue in Sax parser
java io UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence
Setting encoding in web.xml
XMLSerializer/encoding
More...

All times above are in ranch (not your local) time.
The current ranch time is
Apr 16, 2024 05:07:04.