Win a copy of Programmer's Guide to Java SE 8 Oracle Certified Associate (OCA) this week in the OCAJP forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Extract selected links from html

 
purnima Nair
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I need to extract links and text from the html page.


The above code gets all the links and text.I need to get the text corresponding to particular links.
Html page has lots of links.I need to get only selected links except the links in the header ,footer and side menus.Please help.
 
Lester Burnham
Rancher
Posts: 1337
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is this for a particular web site? If so, the headers, footers and menus are probably part of a named DIV, or have a particular class associated. HTML parsers like HtmlUnit (try this one first), nekohtml, htmlcleaner or TagSoup should be able to give you access to that information.
 
purnima Nair
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
No this is to get links from any website not related to a particular website.
 
Lester Burnham
Rancher
Posts: 1337
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That is most likely impossible to achieve in the general sense, unless you can give an algorithm that determines which links should be selected.
 
purnima Nair
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Since it is required for all the websites and we cannot generalise for all the different websites,it will not be possible to get only the selected links.Right??
Now my another query:
My code mentioned above retrives all the text and all the links from the website.So then how can we get the only text corresponding to the particular links.
Eg:
I need to get the text 'Hello' corresponding to the href link 'www.google.com'.

 
Lester Burnham
Rancher
Posts: 1337
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
They're not passed in via the handleText method?

The Swing HTML stuff is kinda weak, though. If this was my problem, I'd use HtmlUnit.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic