• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Reading URL with Java

 
Farakh khan
Ranch Hand
Posts: 833
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
URL yahoo = new URL("http://www.yahoo.com/");

This is getting the main url. How can I read the all urls related to this link e.g. http://mail.yahoo.com or http://www.yahoo.com/cc/bb/tt.asp etc

Thanks & best regards
 
Ben Souther
Sheriff
Posts: 13411
Firefox Browser Redhat VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You would have to parse the results and generate a new URL for each link found.
 
Farakh khan
Ranch Hand
Posts: 833
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Ben Souther:
You would have to parse the results and generate a new URL for each link found.


Can you please explain in code example?

Thanks for your reply
 
Jelle Klap
Bartender
Posts: 1952
7
Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Based on a URL object you can perform an HTTP GET of the HTML document to which the URL points. Once you have the HTML document you would have to parse its body to retrieve all the links (URLs) it contains and convert those to new URL objects. For those URL objects you can do the same and that way you might end up indexing every page of the web site. You just have to be smart about which URLs to retrieve, or the process might take a wee bit of time as it indexes half the pages on the internet...
[ March 28, 2008: Message edited by: Jelle Klap ]
 
Farakh khan
Ranch Hand
Posts: 833
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Jelle Klap:
Based on a URL object you can perform an HTTP GET of the HTML document to which the URL points. Once you have the HTML document you would have to parse its body to retrieve all the links (URLs) it contains and convert those to new URL objects. For those URL objects you can do the same and that way you might end up indexing every page of the web site. You just have to be smart about which URLs to retrieve, or the process might take a wee bit of time as it indexes half the pages on the internet...

[ March 28, 2008: Message edited by: Jelle Klap ]


Thanks for prompt response
Can you please let me know any example or tutorial

Thanks again
 
Jelle Klap
Bartender
Posts: 1952
7
Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This article should get you where you need to go:

http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/
 
Ben Souther
Sheriff
Posts: 13411
Firefox Browser Redhat VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Farakh khan:


Can you please explain in code example?

Thanks for your reply


That would take more time than I have right now.

There is a popular unix program called wget that is used to replicate websites for mirroring. Out of curiosity, I googled 'wget java implementation' to see if anyone has written a Java version and found this project.
http://www.openwfe.org/apidocs/openwfe/org/misc/Wget.html

I'm sure, with a little searching, you could find others that do the same thing.
 
Ben Souther
Sheriff
Posts: 13411
Firefox Browser Redhat VI Editor
 
Jelle Klap
Bartender
Posts: 1952
7
Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
For once
 
Bill Shirley
Ranch Hand
Posts: 457
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Everyone answering assumed you meant that you wanted to access all links from the original.

Your question implies you are trying to find all subdomains and/or all subdirectories available at a site. This is not necessarily possible; there is no standard way to do it. Crawling the site as hinted at *might* be successful.
 
Farakh khan
Ranch Hand
Posts: 833
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Jelle Klap:
This article should get you where you need to go:

http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/


The link is very useful

Thanks
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic