• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Tim Cooke
  • Devaka Cooray
Sheriffs:
  • Liutauras Vilda
  • paul wheaton
  • Rob Spoor
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Piet Souris
  • Mikalai Zaikin
Bartenders:
  • Carey Brown
  • Roland Mueller

nutch 0.7-devel and url redirect

 
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi ,

I am a new commer to the Nutch search engine(nutch-0.7.2).

After install Nutch and Tomcat, I tried to crawl three url one of them is my web application on jboss.

using command as:

nutch crawl urls -dir crawl -depth 3>& crawl.log

where urls ia a file under the nutch directory and contains three urls as
"http://localhost:8080/vinweb"
"http://www.orkut.co.in"
"http://apache.com"


But, after crawling, I checked the crawl.log, seems it
didn't fetch anything

080901 193120 FetchListTool started
080901 193121 Overall processing: Sorted 0 entries in 0.0 seconds.

following is my crawl.log file
*****************************************
run java in C:\Program Files\Java\jdk1.5.0_12
080901 193120 parsing file:/E:/SearchTools/nutch-0.7.2/conf/nutch-default.xml
080901 193120 parsing file:/E:/SearchTools/nutch-0.7.2/conf/crawl-tool.xml
080901 193120 parsing file:/E:/SearchTools/nutch-0.7.2/conf/nutch-site.xml
080901 193120 No FS indicated, using default:local
080901 193120 crawl started in: crawl
080901 193120 rootUrlFile = urls
080901 193120 threads = 10
080901 193120 depth = 3
080901 193120 Created webdb at LocalFS,E:\SearchTools\nutch-0.7.2\crawl\db
080901 193120 Starting URL processing
080901 193120 Plugins: looking in: E:\SearchTools\nutch-0.7.2\plugins
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\clustering-carrot2
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\creativecommons
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\index-basic\plugin.xml
080901 193120 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\index-more
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\language-identifier
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\nutch-extensionpoints\plugin.xml
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\ontology
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\parse-ext
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\parse-html\plugin.xml
080901 193120 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\parse-js
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\parse-msword
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\parse-pdf
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\parse-rss
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\parse-text\plugin.xml
080901 193120 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\protocol-file
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\protocol-ftp
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\protocol-http\plugin.xml
080901 193120 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\protocol-httpclient
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\query-basic\plugin.xml
080901 193120 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\query-more
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\query-site\plugin.xml
080901 193120 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\query-url\plugin.xml
080901 193120 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\urlfilter-prefix
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\urlfilter-regex\plugin.xml
080901 193120 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter
080901 193120 found resource crawl-urlfilter.txt at file:/E:/SearchTools/nutch-0.7.2/conf/crawl-urlfilter.txt
.080901 193120 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
080901 193120 bad url: "http://localhost:8080/vinweb"
.080901 193120 bad url: "http://www.orkut.co.in"
....080901 193120 Added 0 pages
080901 193120 FetchListTool started
080901 193121 Overall processing: Sorted 0 entries in 0.0 seconds.
080901 193121 Overall processing: Sorted NaN entries/second
080901 193121 FetchListTool completed
080901 193121 logging at INFO
080901 193122 Updating E:\SearchTools\nutch-0.7.2\crawl\db
080901 193122 Updating for E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193120
080901 193122 Finishing update
080901 193122 Update finished
080901 193122 FetchListTool started
080901 193122 Overall processing: Sorted 0 entries in 0.0 seconds.
080901 193122 Overall processing: Sorted NaN entries/second
080901 193122 FetchListTool completed
080901 193122 logging at INFO
080901 193123 Updating E:\SearchTools\nutch-0.7.2\crawl\db
080901 193123 Updating for E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193122
080901 193123 Finishing update
080901 193123 Update finished
080901 193123 FetchListTool started
080901 193123 Overall processing: Sorted 0 entries in 0.0 seconds.
080901 193123 Overall processing: Sorted NaN entries/second
080901 193124 FetchListTool completed
080901 193124 logging at INFO
080901 193125 Updating E:\SearchTools\nutch-0.7.2\crawl\db
080901 193125 Updating for E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193123
080901 193125 Finishing update
080901 193125 Update finished
080901 193125 Updating E:\SearchTools\nutch-0.7.2\crawl\segments from E:\SearchTools\nutch-0.7.2\crawl\db
080901 193125 reading E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193120
080901 193125 reading E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193122
080901 193125 reading E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193123
080901 193125 Sorting pages by url...
080901 193125 Getting updated scores and anchors from db...
080901 193125 Sorting updates by segment...
080901 193125 Updating segments...
080901 193125 Done updating E:\SearchTools\nutch-0.7.2\crawl\segments from E:\SearchTools\nutch-0.7.2\crawl\db
080901 193125 indexing segment: E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193120
080901 193125 * Opening segment 20080901193120
080901 193125 * Indexing segment 20080901193120
080901 193125 * Optimizing index...
080901 193125 * Moving index to NFS if needed...
080901 193125 DONE indexing segment 20080901193120: total 0 records in 0.047 s (NaN rec/s).
080901 193125 done indexing
080901 193125 indexing segment: E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193122
080901 193125 * Opening segment 20080901193122
080901 193125 * Indexing segment 20080901193122
080901 193125 * Optimizing index...
080901 193125 * Moving index to NFS if needed...
080901 193125 DONE indexing segment 20080901193122: total 0 records in 0.0 s (NaN rec/s).
080901 193125 done indexing
080901 193125 indexing segment: E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193123
080901 193125 * Opening segment 20080901193123
080901 193125 * Indexing segment 20080901193123
080901 193125 * Optimizing index...
080901 193125 * Moving index to NFS if needed...
080901 193125 DONE indexing segment 20080901193123: total 0 records in 0.0 s (NaN rec/s).
080901 193125 done indexing
080901 193125 Reading url hashes...
080901 193125 Sorting url hashes...
080901 193125 Deleting url duplicates...
080901 193125 Deleted 0 url duplicates.
080901 193125 Reading content hashes...
080901 193125 Sorting content hashes...
080901 193125 Deleting content duplicates...
080901 193125 Deleted 0 content duplicates.
080901 193125 Duplicate deletion complete locally. Now returning to NFS...
080901 193125 DeleteDuplicates complete
080901 193125 Merging segment indexes...
080901 193125 crawl finished: crawl

*******************************************

and following entries are made at my crawl-urlfilter.txt.

*******************************************

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*synapse.com (where synapse is my domain name)

+^http://([a-z0-9]*\.)*apache.org

+^http://([a-z0-9]*\.)*localhost:8080/vinweb

+^http://([a-z0-9]*\.)*orkut.co.in


# skip everything else
-.

*************************************************

And the search result is return NULL in web UI.

Any suggestion will be very helpful,

Thanks,
 
Evacuate the building! Here, take this tiny ad with you:
We need your help - Coderanch server fundraiser
https://coderanch.com/wiki/782867/Coderanch-server-fundraiser
reply
    Bookmark Topic Watch Topic
  • New Topic