Win a copy of The Java Performance Companion this week in the Performance forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

What resources would be required for a java based web crawler

 
Mohit G Gupta
Ranch Hand
Posts: 634
Chrome Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
i am thinking of making a web crawler that fetches not whole of the internet but only documents,word,ppts related to academics..
so,what all resources are required ?
can i implement on my pc or i would need a separate pc for it
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The first resource I would put to use is a web browser in order to google for existing crawlers, or head straight to java-source.net.
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13064
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The standard Java library has all you need to get started.

I feel that web crawling has gotten a lot more complicated as people create more complex pages using more JavaScript to dynamically build a page.

The "semantic web" represents an attempt to allow better tagging of resources in a more academic style.

If I was doing a web crawler now, I would use Google searches as a front-end to locate potential sites of interest.

Bill
 
Mohit G Gupta
Ranch Hand
Posts: 634
Chrome Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
thanks,William Brogden
but,how can semantic web be useful for web crawler
and you said that to use google as front end
how is that possible
my main motive is to make a web crawler and then to use it for search engine which help users to find docs,ppt related to academics.

please help, i am getting confused as i web crawler new to me
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13064
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Google has all sorts of services for developers.

Semantic Web - the intent of the "semantic web" is to provide ways to tag resources (such as HTML pages) so that search engines do a better job. This is a huge topic, jump right in!

Due to the massive inter-connected-ness of the web, a web crawler running on a single computer gets bogged down quickly after you get about 4 or 5 deep in the connections. The computer power Google applies to continuous web crawling is the single most mind-boggling fact of the web today.

Crawling for specific topics may still be feasible but you will need a way to start in the most useful spots and to discard the connections which are less likely to be useful.

Bill
 
Mohit G Gupta
Ranch Hand
Posts: 634
Chrome Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
code.google.com
i checked the site but i was unable to find how use google as a web crawler
how can i use it as front end
please help
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13064
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I was not trying to say that you could use a Google API "as a web crawler" - my suggestion is that you could find usable starting web addresses with a Google API based on the kind of academic topics you appear to be interested in.

You are certainly not going to be able to crawl the entire web, so it seems to me you would want to start on pages that are already in your area.

Bill

(Note the edit, "not trying to say" stupid fingers....)
 
Mohit G Gupta
Ranch Hand
Posts: 634
Chrome Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
i checked the Google Custom Search API ,
so should i use that one for search engine



i am making this for my final year project.
so,is it sufficient
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Shouldn't a school project involve some research of your own? It sounds a bit as if you're not researching much in between asking here.
 
Mohit G Gupta
Ranch Hand
Posts: 634
Chrome Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
i havedone research on web crawler and got a suggestion to use google api
but now i am unable to get how to use it.
as William Brogden said

I was trying to say that you could use a Google API "as a web crawler" - my suggestion is that you could find usable starting web addresses with a Google API based on the kind of academic topics you appear to be interested in.

You are certainly not going to be able to crawl the entire web, so it seems to me you would want to start on pages that are already in your area.


say if i want all stuff related to computer science ,how can this google api help me
i added a project on Google Custom Search API ,
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic