• Post Reply Bookmark Topic Watch Topic
  • New Topic

help to download link content and parse into plaintext

 
komal sutaria
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
HI!
I have gathered links that I need to download. I just want to download content of all the links and as per the content type(html/pdf/...) I want to use parser to parse into text file.

I am looking for some open source API to do this for me.

Please can anyone suggest some.

komal
 
Joe Ess
Bartender
Posts: 9362
11
Linux Mac OS X Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
See the Java Tutorial for how to obtain content from URL's.
To parse content, you'll have to deal with each type differently. The Java API includes an HTML Parser, but for other types you'll need third party libraries.
 
komal sutaria
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hi!
thanks for answering me joe but my point is, before downloading content by URL API i need to identify content type. HOw do i find it? as url lead me to a pdf file.
 
Joe Ess
Bartender
Posts: 9362
11
Linux Mac OS X Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Before downloading it? I don't know of any API to do that.
HTTP has the "Content-Type" header (see the HTTP specificiation, section 14.17 which allows a server to tell a client the MIME type of a URL. That's probably your best bet.
 
komal sutaria
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hi!
Thanks , I had tried Nutch crawler to do things for me but it is giving me error. Do you know any other open source crawler.
komal
 
Joe Ess
Bartender
Posts: 9362
11
Linux Mac OS X Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What is the error with Nutch? It may be something that changing crawlers won't fix (i.e. files protected by authentication, connectivity problems).
I've had good results using the aforementioned HTML parser to locate hyperlinks and feed them back into a queue for visiting with java.net.URL as a homegrown web crawler.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!