Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

No results on sites other than www.apache.org

Daniel Garcia

2008-06-10


Author LoginPost Reply

I've followed the tutorial on the Wiki site and have successfuly indexed a few pages on www.apache.com with the command

bin/nutch crawl /etc/opt/nutch/urls -dir /var/lib/nutch-crawls/test1 -depth 3 -topN 50

a query for "apache" on my local nutch/tomcat installation gives me 52 matching pages. Next I changed

/usr/local/nutch/conf/crawl-urlfilter.txt

to allow to www.circuitcity.com with +^http://www.circuitcity.com/. I also added the root page to /etc/opt/nutch/urls/circuitcity. I clear out my test run with

rm /var/lib/nutch-crawls/test1/* -Rf

and rerun my crawl

bin/nutch crawl /etc/opt/nutch/urls -dir /var/lib/nutch-crawls/test1 -depth 3 -topN 50

I looks like it downloads plenty of pages (all from circuitcity). When I try searching for anything on the tomcat/nutch app I get 0 results all the time. I can switch back to apache and the index turns up results. Is there a config file I missed somewhere?

Regards,
Daniel Garcia


   
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.