I just started exploring nutch for crawling a certain list of domains. What I want to do is to follow all the links from a specific domain: "domainx.com".
That is easy to configure in the:
But when I run the command to create the links database:
bin/nutch readlinkdb crawl/linkdb -dump links
I realized that I only get the links from the domain filter. I want the crawler to report all available links contained in the domain I configured, including the one outside the domain but not following them. So if www.coderanch.com is contained inside domainx.com/index.html, I want that link to be reported but not crawled. Hope I'm explaining myself.
I do some of my very best work in water. Like this tiny ad:
Building a Better World in your Backyard by Paul Wheaton and Shawn Klassen-Koop