Help coderanch get a
new server
by contributing to the fundraiser
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Devaka Cooray
  • Liutauras Vilda
Sheriffs:
  • Jeanne Boyarsky
  • paul wheaton
  • Henry Wong
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Tim Moores
  • Carey Brown
  • Mikalai Zaikin
Bartenders:
  • Lou Hamers
  • Piet Souris
  • Frits Walraven

How to avoid special characters while reading xml through java

 
Greenhorn
Posts: 20
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
In my project we read lot ot xml files in that xml file in some places there are special characters found i try to remove using like below
catch (SAXParseException e)
{
System.out.println("Public ID:"+e.getPublicId());
System.out.println("System ID:"+e.getSystemId());
System.out.println("Line NO:"+e.getLineNumber());
System.out.println("Column NO:"+e.getColumnNumber());
System.out.println("Error MSG:"+e.getMessage());
e.printStackTrace();
throw e;
}
but it throws error while checking only open < and close > and related xml errors. Not for special characters so help me please to remove the special characters in xml through java.
 
Rancher
Posts: 1337
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
What is a "special character" according to your definition? That code handles exceptions, not any particular content that gets parsed. Where did you put it in your parsing code?
 
Marshal
Posts: 28289
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If you mean to say you have XML which is not well-formed because whoever created it didn't escape "<" and ">" in text nodes correctly, the answer is you can't fix that using Java. It's the responsibility of whoever creates an XML document to ensure that it is well-formed, and it's one of the specific design features of XML that parsers are not required to fix up bad data.
 
balamurugan velliambalam
Greenhorn
Posts: 20
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
In my project we read lot ot xml files in that xml file in some places there are special characters found i try to remove using like below
catch (SAXParseException e)
{
System.out.println("Public ID:"+e.getPublicId());
System.out.println("System ID:"+e.getSystemId());
System.out.println("Line NO:"+e.getLineNumber());
System.out.println("Column NO:"+e.getColumnNumber());
System.out.println("Error MSG:"+e.getMessage());
e.printStackTrace();
throw e;
}
but it throws error while checking only open < and close > and related xml errors. Not for special characters so help me please to remove the special characters in xml through java.
What is a "special character" according to your definition? That code handles exceptions, not any particular content that gets parsed. Where did you put it in your parsing code?

Hello Lester Burnham ,
Above code only throws exceptions for xml semantics (structure) but not for the special characters like inverted question mark ¿¿¿¿ inverted exclamatory etc I put the catch statement after the IOException if you not clear with above statements please reply to me
 
balamurugan velliambalam
Greenhorn
Posts: 20
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hello Paul Clapham ,

Xml is well formated but i received xml documents from external it contains special character like this that i mentioned below in blue,

<?xml version="1.0"?>
<note>
<to>Tove</to>¿¿¿¿¿¿¿---> saxParseException do not show any exception to this line so how to find and remove this from my xml
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
 
Lester Burnham
Rancher
Posts: 1337
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Why do you expect to get exceptions for those characters? XML can contain just about any character - that's not cause for any exceptions. If those characters are not supposed to be there, talk to the producer of the XML to fix that.
 
balamurugan velliambalam
Greenhorn
Posts: 20
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Lester Burnham wrote:Why do you expect to get exceptions for those characters? XML can contain just about any character - that's not cause for any exceptions. If those characters are not supposed to be there, talk to the producer of the XML to fix that.



But the producer of the xml was client we don't supposed to fix it so please provide any alternate solution to find that special character in xml file.
 
Ranch Hand
Posts: 174
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Lester Burnham wrote:XML can contain just about any character - that's not cause for any exceptions.


I don't fully agree with that. The parser usually is designed to read a defined encoding. In the provided example no encoding is specified. If I try to parse it as 'UTF-8' I'll get an exception and cannot read the document. If I try to parse it as 'ISO-8859-1' (because it is ANSI encoded) there's no problem at all reading the file. So specifying an xml encoding is always a good idea.


And if the document needs to be 'UTF-8' decoded, but it isn't, then you'll have to talk to the document author and tell him to provide documents with the correct encoding...
 
Lester Burnham
Rancher
Posts: 1337
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

balamurugan velliambalam wrote:But the producer of the xml was client we don't supposed to fix it so please provide any alternate solution to find that special character in xml file.


Even a client has to adhere to the predefined rules on how data is to be delivered. Either the data is in the format it's supposed to be in -in which case you'll have to deal with it- or it isn't, in which case the producer needs to fix it. But you still haven't told us what a special character is according your definition - any non-ASCII character? That would be easy to detect and remove in the SAX characters method.

Peter Taucher wrote:

Lester Burnham wrote:XML can contain just about any character - that's not cause for any exceptions.


I don't fully agree with that. The parser usually is designed to read a defined encoding. In the provided example no encoding is specified. If I try to parse it as 'UTF-8' I'll get an exception and cannot read the document. If I try to parse it as 'ISO-8859-1' (because it is ANSI encoded) there's no problem at all reading the file. So specifying an xml encoding is always a good idea.


Agreed. I was assuming that the document is valid according to its stated encoding (or UTF-8 in the case of no encoding and no BOM). If it's not then that's something to be fixed by the producer. But we don't know what those characters really are (that information likely got lost somewhere along the way from the original XML file to this web page), or what makes them "special", so this is mostly conjecture.
 
Author
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
We get lots of external files; we have a process that removes garbage from them before the XML processing itself. Sometimes this causes its own set of issues, but we can't rely on file produces to do the right thing, so we assume the risk of causing a different sort of issue ourselves. Just another option.
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
A frequent bane in XML documents are MS Word "smart punctuation" characters which are illegal Unicode.

Bill
 
David Newton
Author
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
No doubt :(

We also have onsite QA that cut-and-paste XML payloads from spec docs (Word) into a test page then complain that it doesn't pass validation :/
 
Ranch Hand
Posts: 2187
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

But the producer of the xml was client we don't supposed to fix it so please provide any alternate solution to find that special character in xml file.



1. Write a Korn Shell or Perl script that will read file, remove unwanted character(s), and create new clean file.

2. Pass new clean file to Java-based data processing application.
 
David Newton
Author
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

David Newton wrote:We get lots of external files; we have a process that removes garbage from them before the XML processing itself. Sometimes this causes its own set of issues, but we can't rely on file produces to do the right thing, so we assume the risk of causing a different sort of issue ourselves. Just another option.

 
Don't get me started about those stupid light bulbs.
reply
    Bookmark Topic Watch Topic
  • New Topic