This week's book giveaway is in the Artificial Intelligence and Machine Learning forum.
We're giving away four copies of Machine Learning with R: Expert techniques for predictive modeling and have Brett Lantz on-line!
See this thread for details.
Win a copy of Machine Learning with R: Expert techniques for predictive modeling this week in the Artificial Intelligence and Machine Learning forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Junilu Lacar
  • Jeanne Boyarsky
  • Bear Bibeault
Sheriffs:
  • Knute Snortum
  • Tim Cooke
  • Devaka Cooray
Saloon Keepers:
  • Ron McLeod
  • Stephan van Hulst
  • Tim Moores
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Piet Souris
  • Frits Walraven
  • Ganesh Patekar

Nutch -> Report all domain links but follow just a sublist

 
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I just started exploring nutch for crawling a certain list of domains. What I want to do is to follow all the links from a specific domain: "domainx.com".
That is easy to configure in the:



But when I run the command to create the links database:

bin/nutch readlinkdb crawl/linkdb -dump links

I realized that I only get the links from the domain filter. I want the crawler to report all available links contained in the domain I configured, including the one outside the domain but not following them. So if www.coderanch.com is contained inside domainx.com/index.html, I want that link to be reported but not crawled. Hope I'm explaining myself.

Thanks!
 
It is an experimental device that will make my mind that most powerful force on earth! More powerful than this tiny ad!
Java file APIs (DOC, XLS, PDF, and many more)
https://products.aspose.com/total/java
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!