• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • paul wheaton
  • Ron McLeod
  • Jeanne Boyarsky
Sheriffs:
  • Paul Clapham
Saloon Keepers:
  • Tim Holloway
  • Roland Mueller
Bartenders:

URL Harvester

 
Ranch Hand
Posts: 31
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'm trying to create a Java Application which given a URL will copy to file all the .htm pages of a particular Web site. What is the best strategy for collecting all the URLs relating to the various files on a Web Site? Is there a class which contains a collection for e.g of all the URL's in a given web site or does one have to start with the home page and search through that page for all http references and locate URL's that way.
If anybody could shed any light on this matter I would be very appreciative.
Roz
 
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
First, you should now there are already some free Java crawlers out there you could use and customize.
I also developed a "downloader" years ago when I didn't have Internet access and Teleport was not a choice.
You should start by thinking thoroughly a design as this is not as simple as it looks, a spider has many aspects.
Here are some toughts:
  • make more download threads and have a manager for them
  • have a reference table to keep track of each file status (downloaded, downloading, parsing etc)
  • build links at the finish of all downloadings


  • As for the urls problem, there's no class no give you all the links a page, but you could use regular expressions. Also note that URL(host, any_file) give you an absolute correct url, no matter file si relative to host or is an outside url.
    Also, if you want a challenge - and a feature that I don't know any spider that offers it -, figure out links that are build using JavaScript
    [ January 23, 2002: Message edited by: gigel chiazna ]
     
    Greenhorn
    Posts: 13
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Hi
    Can any one post the code for downloading the html file from the web using java just by giving the url?
    Thanks
    Murali
     
    mister krabs
    Posts: 13974
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    URL url = new URL("http://java.sun.com");
    BufferedReader br = new BufferedReader (new InputStreamReader(url.openStream( )));
    while((input = br.readLine( )) != null)
    System.out.println(input);
    br.close();
     
    Muralidhar Krishnamoorthy
    Greenhorn
    Posts: 13
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Thank you very much. But can i transfer the file as such like in ftp instead of getting through the buffered reader etc..?

    Cheers
    Murali
    [email protected]
    reply
      Bookmark Topic Watch Topic
    • New Topic