Win a copy of Programmer's Guide to Java SE 8 Oracle Certified Associate (OCA) this week in the OCAJP forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Nutch -> Report all domain links but follow just a sublist

 
Gabriel Solano
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I just started exploring nutch for crawling a certain list of domains. What I want to do is to follow all the links from a specific domain: "domainx.com".
That is easy to configure in the:



But when I run the command to create the links database:

bin/nutch readlinkdb crawl/linkdb -dump links

I realized that I only get the links from the domain filter. I want the crawler to report all available links contained in the domain I configured, including the one outside the domain but not following them. So if www.coderanch.com is contained inside domainx.com/index.html, I want that link to be reported but not crawled. Hope I'm explaining myself.

Thanks!
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic