This week's book giveaway is in the Performance forum.
We're giving away four copies of The Java Performance Companion and have Charlie Hunt, Monica Beckwith, Poonam Parhar, & Bengt Rutisson on-line!
See this thread for details.
Win a copy of The Java Performance Companion this week in the Performance forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

URL Harvester

 
Rosie Nelson
Ranch Hand
Posts: 31
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm trying to create a Java Application which given a URL will copy to file all the .htm pages of a particular Web site. What is the best strategy for collecting all the URLs relating to the various files on a Web Site? Is there a class which contains a collection for e.g of all the URL's in a given web site or does one have to start with the home page and search through that page for all http references and locate URL's that way.
If anybody could shed any light on this matter I would be very appreciative.
Roz
 
gigel chiazna
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
First, you should now there are already some free Java crawlers out there you could use and customize.
I also developed a "downloader" years ago when I didn't have Internet access and Teleport was not a choice.
You should start by thinking thoroughly a design as this is not as simple as it looks, a spider has many aspects.
Here are some toughts:
  • make more download threads and have a manager for them
  • have a reference table to keep track of each file status (downloaded, downloading, parsing etc)
  • build links at the finish of all downloadings


  • As for the urls problem, there's no class no give you all the links a page, but you could use regular expressions. Also note that URL(host, any_file) give you an absolute correct url, no matter file si relative to host or is an outside url.
    Also, if you want a challenge - and a feature that I don't know any spider that offers it -, figure out links that are build using JavaScript
    [ January 23, 2002: Message edited by: gigel chiazna ]
     
    Muralidhar Krishnamoorthy
    Greenhorn
    Posts: 13
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hi
    Can any one post the code for downloading the html file from the web using java just by giving the url?
    Thanks
    Murali
     
    Thomas Paul
    mister krabs
    Ranch Hand
    Posts: 13974
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    URL url = new URL("http://java.sun.com");
    BufferedReader br = new BufferedReader (new InputStreamReader(url.openStream( )));
    while((input = br.readLine( )) != null)
    System.out.println(input);
    br.close();
     
    Muralidhar Krishnamoorthy
    Greenhorn
    Posts: 13
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Thank you very much. But can i transfer the file as such like in ftp instead of getting through the buffered reader etc..?

    Cheers
    Murali
    muralidharck@yahoo.com
     
    • Post Reply
    • Bookmark Topic Watch Topic
    • New Topic