• Post Reply Bookmark Topic Watch Topic
  • New Topic

Nutch -> Report all domain links but follow just a sublist  RSS feed

Gabriel Solano
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I just started exploring nutch for crawling a certain list of domains. What I want to do is to follow all the links from a specific domain: "domainx.com".
That is easy to configure in the:

But when I run the command to create the links database:

bin/nutch readlinkdb crawl/linkdb -dump links

I realized that I only get the links from the domain filter. I want the crawler to report all available links contained in the domain I configured, including the one outside the domain but not following them. So if www.coderanch.com is contained inside domainx.com/index.html, I want that link to be reported but not crawled. Hope I'm explaining myself.

  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!