Win a copy of Functional Reactive Programming this week in the Other Languages forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Creating a DOM from html

 
Gautam Velpula
Greenhorn
Posts: 13
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I was trying to create a DOM from a html source.
Which doesn't work because there are lots of tag that are unbalanced.
I used a tag balancer which worked fine and closed all the tags, though I am a little skeptical about the correctness of the resulting html, If it display the same page in the browser, or the balanced tags will garble the layout.
Anyway once I have this balanced xhtml document I am looking to convert it to a DOM structure that I could use for purposes like traversal, search, etc.
Any suggestions???
Thanks in advance
 
Rob Spoor
Sheriff
Pie
Posts: 20669
65
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Some HTML files can never be properly balanced, since HTML is much too lenient. For instance, it allows for tags to overlap:

So how would you do this? Once you encounter the </b>, close the <i> too and then start another <i>? That would do it, but there are other real-life HTML documents that will be much, MUCH harder.

Once you've done this (I suggest taking a look at javax.swing.text.html.parser.ParserDelegator, with a custom callback instance) you can use libraries like JDOM for creating the DOM tree.
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Welcome to JavaRanch.

There are a number of libraries that convert HTML -even those that are not well-formed- into DOM documents. Google for TagSoup, JTidy and CyberNeko in particular.
 
Gautam Velpula
Greenhorn
Posts: 13
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have used nekohtml to balance the tags.
The resulting file is still a unparsable document.

Do you think this is a wise direction, trying to create a html dom? Should I rather look fo alternatives.

Thanks for the replies.
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I guess NekoHTML is a bit weaker than the other ones - both JTidy and TagSoup claim to produce DOM objects.
 
Gautam Velpula
Greenhorn
Posts: 13
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I will try them. Thanks
 
Consider Paul's rocket mass heater.
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic