This week's book giveaways are in the Cloud and AI/ML forums.
We're giving away four copies each of Cloud Native Patterns and Natural Language Processing and have the authors on-line!
See this thread and this one for details.
Win a copy of Cloud Native PatternsE this week in the Cloud forum
or Natural Language Processing in the AI/ML forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Devaka Cooray
  • Liutauras Vilda
  • Jeanne Boyarsky
  • Bear Bibeault
Sheriffs:
  • Paul Clapham
  • Knute Snortum
  • Rob Spoor
Saloon Keepers:
  • Tim Moores
  • Ron McLeod
  • Piet Souris
  • Stephan van Hulst
  • Carey Brown
Bartenders:
  • Tim Holloway
  • Frits Walraven
  • Ganesh Patekar

Unescaped Ampersand

 
Ranch Hand
Posts: 296
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I'm trying to parse some RSS feeds using the DocumentBuilder class.

Some of the feeds have unescaped ampersands in the elements and this is causing problems.

For example, the title element might have the value "Bob & His Dog" which gets parsed as "Bob " because of the ampersand.

I have no control over the feeds.

What is the best way to handle this?

Thanks,

Drew
 
Sheriff
Posts: 24594
55
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, unfortunately, RSS is supposed to be an XML dialect but in real life there is RSS-creation software that isn't as good as it should be. Or maybe people are using text editors to write RSS. Anyway, the end result is malformed XML.

You could just reject it (that would be The XML Way™). The alternative is to stop using an XML parser and use something else. I've heard about ROME for parsing RSS in Java but haven't ever used it myself.
 
Drew Lane
Ranch Hand
Posts: 296
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How would I reject it? It's not throwing an error.

Could I just reject the one element with the & or do you mean just leave out that whole feed (not a good solution).

-Drew
 
Paul Clapham
Sheriff
Posts: 24594
55
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Originally posted by Drew Lane:
How would I reject it? It's not throwing an error.

It's not throwing an exception? Then you don't have malformed XML and therefore you don't have an unescaped ampersand.

My suspicion is that you're using a SAX parser and making the common error of assuming that the characters() method in your ContentHandler will always be passed the complete contents of a text node. That isn't the case. The parser is allowed to break text into pieces and call characters() once for each piece.
 
Drew Lane
Ranch Hand
Posts: 296
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Holy Cow!

You were right!

You rawk dude. :-)

Thanks,

-Drew
 
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm having a similar problem with my XML parsing; it just stops whenever an ampersand is hit. How can I modify the characters() method to ignore that case? Any sample code would be greatly appreciated.
 
Paul Clapham
Sheriff
Posts: 24594
55
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Of course you can't modify the parser to call the characters() method differently than what it does. You have to modify your code if you want to collect all of the character data inside an element together. See the first question in the XmlFaq.
 
Don't get me started about those stupid light bulbs.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!