• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

kXml - Preventing expanding entity references in attribute values

 
Ranch Hand
Posts: 350
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi

I am using kXml for parsing a well formed html page and I am having a problem since this parser expands entity references in attributes values.
Since the page that I am parsing is an HTML page it contains something on these lines

...href="http://foobar.com/FooToos.aspx?ito=4912&itc=0"...

As you can see the parser reads the attribute values &itc=0 and thinks that it is a begingning of an entity and then falls over since it doesnt get an ending ; it complains that it could not resolve &itc

But as you can see that is not an entity ref rather it is paramters passed to the page FooToos.aspx.

So comming back to my questions.

Has anyone going around and modified kXml source code so that it doesnt be too smart and starts expanding all the entity references it encounters in attribute values.
 
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I had a similar problem in my XHTML page and it turned out to be a bug on my side. I would say that file is not well formed. Every literal "&" must be escaped with "& a m p ;" even within attribute values. The W3C validator will not like your page either.

If you want to use a literal ampersand in your document you must encode it as "& a m p ;" (even inside URLs!).


[ July 11, 2004: Message edited by: Alexander Traud ]
 
Vivek Viswanathan
Ranch Hand
Posts: 350
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi

Thanks for the reply, the only problem in my case is that the HTML page that I am parsing is not a page developed my me. It is a page of some web site so I do not have access to the html generated by them.

Vivek
 
wrangler
Posts: 30
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
It ought to be pretty straightfoward to write your own HTML to XHTML converter servlet, cgi bin page, etc. using something off-the-shelf like Tidy or JTidy? O'Reilly has something along those lines here: http://www.oreillynet.com/network/2000/04/28/feature/index.csp.

Off the top of my head, it doesn't sound like a great deal of work.
For an HTTP GET, it is probably relatively straightforward and perhaps
with some googling one could find lots of helper packages and APIs
for converting HTML to XHTML.

Theoretically, a MIDlet could also do a conversion to XHTML too.
But MIDlet size and memory (e.g. for large HTML pages) might be
problematic.

Your mileage may vary :-).
james
 
Vivek Viswanathan
Ranch Hand
Posts: 350
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Cheers Mate.

But I had talk with the site that is providing the html page and I asked them if they provide me with a web service rather than me parsing html pages and told me that they can provide me with an xml reply ( rather than an html).
So I am back in the game now parsing the xml document, though I had to ditch all the old code that I had written to parse the html page.
 
reply
    Bookmark Topic Watch Topic
  • New Topic