• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Paul Clapham
  • Ron McLeod
  • Tim Cooke
  • Junilu Lacar
Sheriffs:
  • Rob Spoor
  • Devaka Cooray
  • Jeanne Boyarsky
Saloon Keepers:
  • Jesse Silverman
  • Stephan van Hulst
  • Tim Moores
  • Carey Brown
  • Tim Holloway
Bartenders:
  • Jj Roberts
  • Al Hobbs
  • Piet Souris

help to download link content and parse into plaintext

 
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
HI!
I have gathered links that I need to download. I just want to download content of all the links and as per the content type(html/pdf/...) I want to use parser to parse into text file.

I am looking for some open source API to do this for me.

Please can anyone suggest some.

komal
 
Bartender
Posts: 9626
16
Mac OS X Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
See the Java Tutorial for how to obtain content from URL's.
To parse content, you'll have to deal with each type differently. The Java API includes an HTML Parser, but for other types you'll need third party libraries.
 
komal sutaria
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
hi!
thanks for answering me joe but my point is, before downloading content by URL API i need to identify content type. HOw do i find it? as url lead me to a pdf file.
 
Joe Ess
Bartender
Posts: 9626
16
Mac OS X Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Before downloading it? I don't know of any API to do that.
HTTP has the "Content-Type" header (see the HTTP specificiation, section 14.17 which allows a server to tell a client the MIME type of a URL. That's probably your best bet.
 
komal sutaria
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
hi!
Thanks , I had tried Nutch crawler to do things for me but it is giving me error. Do you know any other open source crawler.
komal
 
Joe Ess
Bartender
Posts: 9626
16
Mac OS X Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
What is the error with Nutch? It may be something that changing crawlers won't fix (i.e. files protected by authentication, connectivity problems).
I've had good results using the aforementioned HTML parser to locate hyperlinks and feed them back into a queue for visiting with java.net.URL as a homegrown web crawler.
 
You showed up just in time for the waffles! And this tiny ad:
Building a Better World in your Backyard by Paul Wheaton and Shawn Klassen-Koop
https://coderanch.com/wiki/718759/books/Building-World-Backyard-Paul-Wheaton
reply
    Bookmark Topic Watch Topic
  • New Topic