Win a copy of Fixing your Scrum this week in the Agile forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Rob Spoor
  • Liutauras Vilda
Sheriffs:
  • Jeanne Boyarsky
  • Junilu Lacar
  • Tim Cooke
Saloon Keepers:
  • Tim Holloway
  • Piet Souris
  • Stephan van Hulst
  • Tim Moores
  • Carey Brown
Bartenders:
  • Frits Walraven
  • Himai Minh

What resources would be required for a java based web crawler

 
Ranch Hand
Posts: 634
Eclipse IDE Chrome Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
i am thinking of making a web crawler that fetches not whole of the internet but only documents,word,ppts related to academics..
so,what all resources are required ?
can i implement on my pc or i would need a separate pc for it
 
Rancher
Posts: 43026
76
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The first resource I would put to use is a web browser in order to google for existing crawlers, or head straight to java-source.net.
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The standard Java library has all you need to get started.

I feel that web crawling has gotten a lot more complicated as people create more complex pages using more JavaScript to dynamically build a page.

The "semantic web" represents an attempt to allow better tagging of resources in a more academic style.

If I was doing a web crawler now, I would use Google searches as a front-end to locate potential sites of interest.

Bill
 
Mohit G Gupta
Ranch Hand
Posts: 634
Eclipse IDE Chrome Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
thanks,William Brogden
but,how can semantic web be useful for web crawler
and you said that to use google as front end
how is that possible
my main motive is to make a web crawler and then to use it for search engine which help users to find docs,ppt related to academics.

please help, i am getting confused as i web crawler new to me
 
William Brogden
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Google has all sorts of services for developers.

Semantic Web - the intent of the "semantic web" is to provide ways to tag resources (such as HTML pages) so that search engines do a better job. This is a huge topic, jump right in!

Due to the massive inter-connected-ness of the web, a web crawler running on a single computer gets bogged down quickly after you get about 4 or 5 deep in the connections. The computer power Google applies to continuous web crawling is the single most mind-boggling fact of the web today.

Crawling for specific topics may still be feasible but you will need a way to start in the most useful spots and to discard the connections which are less likely to be useful.

Bill
 
Mohit G Gupta
Ranch Hand
Posts: 634
Eclipse IDE Chrome Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
code.google.com
i checked the site but i was unable to find how use google as a web crawler
how can i use it as front end
please help
 
William Brogden
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I was not trying to say that you could use a Google API "as a web crawler" - my suggestion is that you could find usable starting web addresses with a Google API based on the kind of academic topics you appear to be interested in.

You are certainly not going to be able to crawl the entire web, so it seems to me you would want to start on pages that are already in your area.

Bill

(Note the edit, "not trying to say" stupid fingers....)
 
Mohit G Gupta
Ranch Hand
Posts: 634
Eclipse IDE Chrome Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
i checked the Google Custom Search API ,
so should i use that one for search engine



i am making this for my final year project.
so,is it sufficient
 
Ulf Dittmer
Rancher
Posts: 43026
76
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Shouldn't a school project involve some research of your own? It sounds a bit as if you're not researching much in between asking here.
 
Mohit G Gupta
Ranch Hand
Posts: 634
Eclipse IDE Chrome Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
i havedone research on web crawler and got a suggestion to use google api
but now i am unable to get how to use it.
as William Brogden said

I was trying to say that you could use a Google API "as a web crawler" - my suggestion is that you could find usable starting web addresses with a Google API based on the kind of academic topics you appear to be interested in.

You are certainly not going to be able to crawl the entire web, so it seems to me you would want to start on pages that are already in your area.



say if i want all stuff related to computer science ,how can this google api help me
i added a project on Google Custom Search API ,
 
It's weird that we cook bacon and bake cookies. Eat this tiny ad:
Thread Boost feature
https://coderanch.com/t/674455/Thread-Boost-feature
reply
    Bookmark Topic Watch Topic
  • New Topic