• Post Reply Bookmark Topic Watch Topic
  • New Topic

crawler  RSS feed

 
bskkodee apv
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hey..!
i was tryin 2 build a searc engine 4 which i need a crawler. the idea is to crawl from a starting point and recursively open subsequent pages. this is not gud enuf 2 crawl the entire internet. so i need 2 know if there is some way i can query the DNS so that i can switch from one domain 2 anoher.
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Welcome to JavaRanch.

Why do you need to query the DNS for switching between domains? The java.net.URL class (which I'm assuming you use to access remote hosts) handles either domain names or IP addresses, wo whichever is used on the pages you are crawling should work finw without additional work.

When building the crawler, be sure to observe the rules laid down by robots.txt and and any applicable meta tags about retrieving and indexing pages.
 
anand vijayan
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hey ..
im usin java.net.URL class itself. it helps me estblsh a remote connection. but then.. say 4 instance im lukin 4 the word metallica and the most probabl result is www.metallica.com. if i don switch bw domains frm the starting point, there is no way i can come 2 that site in the crawling process. so theres no way i can reach that site frm my starting pt. s i need 2 switch bw domains. soif theres some way i can do this. plz lemme kno..
thx.
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Not following at all - are you searching or crawling? You seem to combine them in a way I don't understand. Either way, there is no need to access DNS information, so what exactly do you mean by "switching between domains"? The URL class doesn't care which domain you access, or whether you chnage domains with every other call.

By the way, you should really use the same login for posting, even if you post from different IP addresses. It's quite confusing otherwise, and leaves questions as to who is actually posting here.
 
anand vijayan
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
k..
my prob is.. i need to crawl the entire internet.. how do i do it..?
i thot i cud start at some site say www.abc.com and recursively crawl the tree. as well as switch domains.
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Several points in no particular order:
  • In a forum like this you should UseRealWords. Not everybody here speaks Englisch as their native language, and it's hard enough to follow discussions as it is. Abbreviations like "thot", "cud", "2" and "4" are appropriate in text messaging, but not in a forum like this.
  • "crawling the whole internet" is a very dubious proposition. My suggestions would be: don't do it. You're hogging bandwidth, putting unnecessary load on peoples servers, and of course, it's not going to work (the internet is kinda big these days).
  • If you need search, use Google. If you need a crawler, use one of the existing ones. Just be aware that crawlers aren't welcome everywhere (read my earlier remark about robots.txt and related HTTP headers).
  • The fact that you still think that accessing different domains is a problem indicates that you should do some research about TCP/IP and Java networking.

  •  
    Lasse Koskela
    author
    Sheriff
    Posts: 11962
    5
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Originally posted by Ulf Dittmer:
    Abbreviations like "thot", "cud", "2" and "4" are appropriate in text messaging, but not in a forum like this.

    Actually, I wouldn't consider that "appropriate" in text messaging either.
    Honestly, if you've got something to say that doesn't fit into a single text message spelling words out in full, you'd better make a call instead or simply ask the recipient to call you back.
     
    • Post Reply Bookmark Topic Watch Topic
    • New Topic
    Boost this thread!