I just started exploring nutch for crawling a certain list of domains. What I want to do is to follow all the links from a specific domain: "domainx.com".
That is easy to configure in the:
But when I run the command to create the links database:
bin/nutch readlinkdb crawl/linkdb -dump links
I realized that I only get the links from the domain filter. I want the crawler to report all available links contained in the domain I configured, including the one outside the domain but not following them. So if www.coderanch.com is contained inside domainx.com/index.html, I want that link to be reported but not crawled. Hope I'm explaining myself.
It is an experimental device that will make my mind that most powerful force on earth! More powerful than this tiny ad!