This week's book giveaway is in the Kotlin forum.
We're giving away four copies of Kotlin in Action and have Dmitry Jemerov & Svetlana Isakova on-line!
See this thread for details.
Win a copy of Kotlin in Action this week in the Kotlin forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

How to remove special/Invalid charchters before reading the XML file  RSS feed

 
Aditya Bhanose
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have functionality which reads XML file, gets data from the nodes and take an appropriate action to execute the business logic. I am using the below code:

SAXBuilder builder = new SAXBuilder();
Document doc = builder.build(fileName);

XML file I am getting as an input is modified by the end user. He adds the data into it and then using my functionality that data is getting into system. In this scenario I am getting an JDOM parser exception "Document root element is missing.."
Reason is the editor used to modify XML appends some charachters at the begining of file and it fails in reading this.
Is there any way I can handle this problem? Either removing this chars or ignoring them. These chars are not predicatable, those might be some garbage values also.



 
Joe Gilvary
Ranch Hand
Posts: 152
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It's a brittle solution, but you can write something else to preprocess the file to strip away the "garbage" and produce a valid XML output. The XML classes in Java pretty much expect to find valid XML in their inputs.

If you can't change the process to avoid the prepended characters but you can detect some rhyme or reason to them, you can make this approach more robust.
 
Paul Clapham
Sheriff
Posts: 22472
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The proper way to solve this problem would be to not have the user modifying XML files. It's easy for people to create malformed XML, even if they do know the rules for well-formed XML. Which most people don't.

I don't believe your users are keying in those few non-printable characters which are not allowed in XML documents. I think this "garbage" you're talking about is more likely to be missing ">" characters at the end of tags or unescaped ampersands or just plain fat-finger mistyping. There's no way to fix that sort of thing via a preprocessor.

But if you're stuck with people modifying XML by hand, then give them an XML editor to do that with. Don't let them use Notepad or something like that. Make sure they send you well-formed XML.
 
Vinod Borole
Greenhorn
Posts: 26
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I guess the special characters that you are mentioning about is called the Byte Order Mark (BOM). These are the characters inserted by the editor at the start when you edit the xml file. BOM is basically used to find out the encoding of the xml file.

May be the xml parser that you are using does not understand BOM and throws an exception.

Anybody has any thoughts about this?
 
Paul Clapham
Sheriff
Posts: 22472
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Vinod Borole wrote:I guess the special characters that you are mentioning about is called the Byte Order Mark (BOM). These are the characters inserted by the editor at the start when you edit the xml file.

This is definitely a possibility. Notepad has a bad habit of doing that. So if you gave people a proper XML editor to edit their XML with, that wouldn't happen.
 
Aditya Bhanose
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes, my problem is about Byte order mark only. I think there is no way we can handle this in code.
I need to provide a proper XML editor to end users.
 
Paul Clapham
Sheriff
Posts: 22472
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Aditya Bhanose wrote:Yes, my problem is about Byte order mark only. I think there is no way we can handle this in code.
Sure you can handle it in code. You just wrap the InputStream that contains the XML (with possible byte order mark) in another InputStream which skips over anything preceding the first "<" character when it's created. PushbackInputStream is a useful basis for that.

But don't underestimate the ability of end-users to botch up XML documents in other ways. I did it myself yesterday -- I made a little change to a configuration file which caused my application to fail to start up correctly.
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!