• Post Reply Bookmark Topic Watch Topic
  • New Topic

HTML Parser without Correcting HTML  RSS feed

 
Muhammad Zaheer Ahmad
Greenhorn
Posts: 7
Eclipse IDE Firefox Browser Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello Every one,
I have a dump of html files, and i want to analyze the files having some incorrect attributes, like one case some div has class attribute two times:
<div class="test1" .... class="test2">......</div>
As html is generated by CMS i have no control over it, i just want to parse it without correcting it, Normal DOM and SAX parser failed due to some incorrect html, jsoup parse but also corrects the error so after parsing the errors are gone so can not find out incorrect files.
Can any body tell me any api or any idea how to do this.
Thanks
Zaheer
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Muhammad Zaheer Ahmad wrote:I have a dump of html files, and i want to analyze the files having some incorrect attributes, like one case some div has class attribute two times:
...
As html is generated by CMS i have no control over it, i just want to parse it without correcting it, Normal DOM and SAX parser failed due to some incorrect html, jsoup parse but also corrects the error so after parsing the errors are gone so can not find out incorrect files.

So, what do you want to do? Parse, or find out which files are incorrect? You can't do both.

JTidy might be the quickest solution if you simply want to scan for errors; and it will also produce output that can be parsed. But you can't do both, because pretty much any parser will require your HTML to be well-formed and correct.

This page also contains some validators that you might find useful.

Winston
 
Ahsan Bagwan
Ranch Hand
Posts: 254
1
Java MySQL Database Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston's advice is pretty good.

Apart from that, I have happened to use HtmlCleaner and found it to be easy to grok.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!