This week's book giveaway is in the Programmer Certification forum.
We're giving away four copies of OCP Oracle Certified Professional Java SE 11 Programmer I Study Guide: Exam 1Z0-815 and have Jeanne Boyarsky & Scott Selikoff on-line!
See this thread for details.
Win a copy of OCP Oracle Certified Professional Java SE 11 Programmer I Study Guide: Exam 1Z0-815 this week in the Programmer Certification forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Junilu Lacar
  • Jeanne Boyarsky
  • Bear Bibeault
Sheriffs:
  • Knute Snortum
  • Devaka Cooray
  • Tim Cooke
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Ron McLeod
  • Carey Brown
Bartenders:
  • Paweł Baczyński
  • Piet Souris
  • Vijitha Kumara

Web Crawler in Java

 
Ranch Hand
Posts: 598
Android Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have a task in which I have to scan the HTML source of web pages and extract some information depending upon the pattern. The extracted information will be save in the database for the business purpose. The amount of data being extracted in not much.

I am searching for an appropriate web crawler written in java.
Does anyone has some suggestion to give or any other inputs to be share.

Any kind of help is highly appreciated.
 
Rancher
Posts: 1337
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You can find crawlers at www.java-source.net. For extracting information, use a library like HtmlUnit.
 
Himanshu Gupta
Ranch Hand
Posts: 598
Android Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Lester for the reply. I have selected few crawlers projects and have decided to try them making a small prototype of each.
 
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I use httpclient for similar purposes...

Another option is to call wget command from java. It works fine when you have https
 
Greenhorn
Posts: 14
MySQL Database Spring Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Its better to use perl for this kind of tasks. You have mechanize or LWP library on CPAN. HTMLUnit is too slow.
 
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think so you can try php examples firs, which you can find here:

http://broobee.com/folder/12/crawlers-applications
 
It wasn't my idea to go to some crazy nightclub in the middle of nowhere. I just wanted to stay home and cuddle with this tiny ad:
Java file APIs (DOC, XLS, PDF, and many more)
https://products.aspose.com/total/java
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!