• Post Reply Bookmark Topic Watch Topic
  • New Topic

parsing HTML and extracting html tables in JAVA  RSS feed

 
swaroop rath
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi

I am looking for a JAVA API that can parse html and extract tables from them. Please recommend a solution.

Regards
Swaroop
 
Sean Clark
Rancher
Posts: 377
Android Java Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hey,

I have used JTidy to do this in the past. It provides you with a nice DOM which you can then get your data from using pasers or Xpath.
It works well from what I used and can also deal with malformed HTML (though to what extent I don't know).
Hope that helps.

Sean
 
swaroop rath
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks a lot Sean. It is a lot of help indeed. Do I need to provide the XPath of html table or can I query it using a xpath expression such as "//<table>".

I am looking for something that supports for the later. I do not have to search in the whole DOM and it would give me all the tables present at once.

Regards
Swaroop
 
Sean Clark
Rancher
Posts: 377
Android Java Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hey,

It's been a while since I have done XPath queries so you'll have to look it up or maybe someone else here can help.
If the table you are looking for has an id you should be able to load the table from that - or if it has a special class or if any element above it has a special class/id you should be able to limit the search by that.

Sean
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If the task also involves downloading the HTML from a web site then I recommend looking into jWebUnit. It handles that part as well providing a nice high-level API to get at page -and table- contents (in addition to using XPath).
 
swaroop rath
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Ulf. It also involves downloading the html page from a website and looking for tables. I will have a look into jWebUnit. Thanks

Regards
Swaroop
 
swaroop rath
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Ulf
Thanks again. But JWebUnit is a testing framework and more like Selenium. Its more heavy. Its good that it does not need an external browser to work like Selenium RC. But how about something lighter like "http://htmlparser.sourceforge.net/" I do not have an experience using htmlparser. Do you have a suggestion here.

Regards
Swaroop
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm not sure what you mean by "it's more heavy", but jWebUnit works just fine as a library for programmatic web access; just ignore everything where the docs talk about testing.

I'm fairly certain that libraries like jTidy, TagSoup and htmlparser don't address downloading the actual HTML, not do they provide as versatile an API as jWebUnit for getting at parts of the page. If you prefer a lower-level API -where you have to write more code- then by all means, go with htmlparser.
 
swaroop rath
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
@Ulf - Many thanks. I have a clear idea now.

Regards
Swaroop
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!