• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

parse a web page using Java

 
Nilesh Vijaywargiay
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I am working on parsing a web page using java. I have crawled around web and read about the various parsers - html parser, jtidy, jericho etc.
I am in confusion as to which parser to use.

I have to basically parse a page, for example eBay, and then retrieve the results for a given query. For example, if laptop is the query, i want to be able to retrieve the various results populated by the server.

Do I have to use a third party API or java provides something which can be handy for my problem?

Thanks much!
Nilesh
 
David Newton
Author
Rancher
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Your best bet is a parser; which to use is pretty much up to you.
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The easiest is probably to use one of the parsers that create valid XML (like TagSoup) and then to treat the problem as an XML processing issue. That way you can use XPath or XQuery. You may also want to check out HtmlUnit which handles the HTML retrieval as well as the HTML parsing.
 
Nilesh Vijaywargiay
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks David and Ulf.

I am newbie for this field so was looking for a parser with good documentation so that I can get hold of it and use it in future without any problems.
Could you suggest a decent parser with good documentation?

Thanks much!
Nilesh
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Good documentation is not always available for open source projects - you get what you pay for. I suggest to investigate the libraries I mentioned, and see if you run into any problems.
 
Nilesh Vijaywargiay
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
HI Thanks. I've run into a problem

I am stuck at a position in which I have to retrieve the text between a particular tag
<div ... > I want to be retrieved </div>
<a> I want to be retrieved </a>

Any suggestions? I am currently using jericho parser.
 
David Newton
Author
Rancher
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'd suggest looking at its documentation and examples, there are *many* examples of doing precisely what you're trying to do.

http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Element.html
 
Nilesh Vijaywargiay
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks a ton David!!! Was able to finish off the task!! Appreciate the help
 
David Newton
Author
Rancher
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Great--glad you got it working :)
 
Nilesh Vijaywargiay
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
David One more question.

The results I am getting through the program are not exactly in the same order as it appears on the website. This phenomenom is not regular but I am wondering why ..

Is it because the query is being fired from a program rather than a browser? I tried setting the currentCompatibilityConfig of jericho parser to MOZILLA. IE but didn't find the results to change. I have read that you have to set a user agent? I coudn't find a way to do that in jericho parser. Any suggestions?

 
Campbell Ritchie
Sheriff
Pie
Posts: 49813
69
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
No longer a "beginning" topic. Moving thread.
 
David Newton
Author
Rancher
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have no idea--I've never used it. Personally, I'd get the HTML source using something else that would allow me to set the user agent, handled cookies, etc.

As far as what search results are used, that would depend entirely on the website you're trying to scrape (please make sure you're not violating anybody's terms of service). It could depend on the user agent, cookies, previous searches, moon phase... anything.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic