• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Web Crawler in Java

 
Ranch Hand
Posts: 598
Android Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have a task in which I have to scan the HTML source of web pages and extract some information depending upon the pattern. The extracted information will be save in the database for the business purpose. The amount of data being extracted in not much.

I am searching for an appropriate web crawler written in java.
Does anyone has some suggestion to give or any other inputs to be share.

Any kind of help is highly appreciated.
 
Rancher
Posts: 1337
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You can find crawlers at www.java-source.net. For extracting information, use a library like HtmlUnit.
 
Himanshu Gupta
Ranch Hand
Posts: 598
Android Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Lester for the reply. I have selected few crawlers projects and have decided to try them making a small prototype of each.
 
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I use httpclient for similar purposes...

Another option is to call wget command from java. It works fine when you have https
 
Greenhorn
Posts: 14
MySQL Database Spring Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Its better to use perl for this kind of tasks. You have mechanize or LWP library on CPAN. HTMLUnit is too slow.
 
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I think so you can try php examples firs, which you can find here:

http://broobee.com/folder/12/crawlers-applications
 
reply
    Bookmark Topic Watch Topic
  • New Topic