Hi ,
I am a new commer to the Nutch search engine(nutch-0.7.2).
After install Nutch and
Tomcat, I tried to crawl three url one of them is my web application on
jboss.
using command as:
nutch crawl urls -dir crawl -depth 3>& crawl.log
where urls ia a file under the nutch directory and contains three urls as
"http://localhost:8080/vinweb"
"http://www.orkut.co.in"
"http://apache.com"
But, after crawling, I checked the crawl.log, seems it
didn't fetch anything
080901 193120 FetchListTool started
080901 193121 Overall processing: Sorted 0 entries in 0.0 seconds.
following is my crawl.log file
*****************************************
run
java in C:\Program Files\Java\jdk1.5.0_12
080901 193120 parsing file:/E:/SearchTools/nutch-0.7.2/conf/nutch-default.xml
080901 193120 parsing file:/E:/SearchTools/nutch-0.7.2/conf/crawl-tool.xml
080901 193120 parsing file:/E:/SearchTools/nutch-0.7.2/conf/nutch-site.xml
080901 193120 No FS indicated, using default:local
080901 193120 crawl started in: crawl
080901 193120 rootUrlFile = urls
080901 193120 threads = 10
080901 193120 depth = 3
080901 193120 Created webdb at LocalFS,E:\SearchTools\nutch-0.7.2\crawl\db
080901 193120 Starting URL processing
080901 193120 Plugins: looking in: E:\SearchTools\nutch-0.7.2\plugins
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\clustering-carrot2
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\creativecommons
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\index-basic\plugin.xml
080901 193120 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\index-more
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\language-identifier
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\nutch-extensionpoints\plugin.xml
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\ontology
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\parse-ext
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\parse-html\plugin.xml
080901 193120 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\parse-js
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\parse-msword
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\parse-pdf
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\parse-rss
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\parse-text\plugin.xml
080901 193120 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\protocol-file
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\protocol-ftp
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\protocol-http\plugin.xml
080901 193120 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\protocol-httpclient
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\query-basic\plugin.xml
080901 193120 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\query-more
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\query-site\plugin.xml
080901 193120 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\query-url\plugin.xml
080901 193120 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\urlfilter-prefix
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\urlfilter-regex\plugin.xml
080901 193120 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter
080901 193120 found resource crawl-urlfilter.txt at file:/E:/SearchTools/nutch-0.7.2/conf/crawl-urlfilter.txt
.080901 193120 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
080901 193120 bad url: "http://localhost:8080/vinweb"
.080901 193120 bad url: "http://www.orkut.co.in"
....080901 193120 Added 0 pages
080901 193120 FetchListTool started
080901 193121 Overall processing: Sorted 0 entries in 0.0 seconds.
080901 193121 Overall processing: Sorted NaN entries/second
080901 193121 FetchListTool completed
080901 193121 logging at INFO
080901 193122 Updating E:\SearchTools\nutch-0.7.2\crawl\db
080901 193122 Updating for E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193120
080901 193122 Finishing update
080901 193122 Update finished
080901 193122 FetchListTool started
080901 193122 Overall processing: Sorted 0 entries in 0.0 seconds.
080901 193122 Overall processing: Sorted NaN entries/second
080901 193122 FetchListTool completed
080901 193122 logging at INFO
080901 193123 Updating E:\SearchTools\nutch-0.7.2\crawl\db
080901 193123 Updating for E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193122
080901 193123 Finishing update
080901 193123 Update finished
080901 193123 FetchListTool started
080901 193123 Overall processing: Sorted 0 entries in 0.0 seconds.
080901 193123 Overall processing: Sorted NaN entries/second
080901 193124 FetchListTool completed
080901 193124 logging at INFO
080901 193125 Updating E:\SearchTools\nutch-0.7.2\crawl\db
080901 193125 Updating for E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193123
080901 193125 Finishing update
080901 193125 Update finished
080901 193125 Updating E:\SearchTools\nutch-0.7.2\crawl\segments from E:\SearchTools\nutch-0.7.2\crawl\db
080901 193125 reading E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193120
080901 193125 reading E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193122
080901 193125 reading E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193123
080901 193125 Sorting pages by url...
080901 193125 Getting updated scores and anchors from db...
080901 193125 Sorting updates by segment...
080901 193125 Updating segments...
080901 193125 Done updating E:\SearchTools\nutch-0.7.2\crawl\segments from E:\SearchTools\nutch-0.7.2\crawl\db
080901 193125 indexing segment: E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193120
080901 193125 * Opening segment 20080901193120
080901 193125 * Indexing segment 20080901193120
080901 193125 * Optimizing index...
080901 193125 * Moving index to NFS if needed...
080901 193125 DONE indexing segment 20080901193120: total 0 records in 0.047 s (NaN rec/s).
080901 193125 done indexing
080901 193125 indexing segment: E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193122
080901 193125 * Opening segment 20080901193122
080901 193125 * Indexing segment 20080901193122
080901 193125 * Optimizing index...
080901 193125 * Moving index to NFS if needed...
080901 193125 DONE indexing segment 20080901193122: total 0 records in 0.0 s (NaN rec/s).
080901 193125 done indexing
080901 193125 indexing segment: E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193123
080901 193125 * Opening segment 20080901193123
080901 193125 * Indexing segment 20080901193123
080901 193125 * Optimizing index...
080901 193125 * Moving index to NFS if needed...
080901 193125 DONE indexing segment 20080901193123: total 0 records in 0.0 s (NaN rec/s).
080901 193125 done indexing
080901 193125 Reading url hashes...
080901 193125 Sorting url hashes...
080901 193125 Deleting url duplicates...
080901 193125 Deleted 0 url duplicates.
080901 193125 Reading content hashes...
080901 193125 Sorting content hashes...
080901 193125 Deleting content duplicates...
080901 193125 Deleted 0 content duplicates.
080901 193125 Duplicate deletion complete locally. Now returning to NFS...
080901 193125 DeleteDuplicates complete
080901 193125 Merging segment indexes...
080901 193125 crawl finished: crawl
*******************************************
and following entries are made at my crawl-urlfilter.txt.
*******************************************
# The url filter file used by the crawl command.
# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching
pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*synapse.com (where synapse is my domain name)
+^http://([a-z0-9]*\.)*apache.org
+^http://([a-z0-9]*\.)*localhost:8080/vinweb
+^http://([a-z0-9]*\.)*orkut.co.in
# skip everything else
-.
*************************************************
And the search result is return NULL in web UI.
Any suggestion will be very helpful,
Thanks,