Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Something very, very strange....about how my nutch runs... please
help!

nutch_newbie

2008-06-14


Author LoginPost Reply

My nutch is on my localhost, and seems to be running fine...
Here is what is very strange: out of 18 websites i put in the
crawl-urlfilter.txtm and in my urls folder, only 3 websites come up in the
search, and one of them is even not on my list... Very weird! Please take a
look at my configurations(below) and see if you have any suggestions. (i
suspect that i need to recrawl or something, but the recrawl script on wiki
nutch didn't work. Also, should't google results come up too? )
The 3 websites that are searched are http://www.horse.com,
http://en.wikipedia.org, and this one, which is not my lsit:
http://www.ansi.okstate.edu/.
Also, if i edit
/opt/apache-tomcat-5.5.16/webapps/nutch-0.8.1/WEB-INF/classes/nutch-site.xml,
then my nutch dosn't search at all! it just says Hits 0-0 (out of about 0
total matching pages): .

Here are some files i edited:
crawl-urlfilter.txt
accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*en.wikipedia.org/
+^http://([a-z0-9]*\.)*www.google.com/
+^http://([a-z0-9]*\.)*www.search.yahoo.com/
+^http://([a-z0-9]*\.)*www.apache.org/
+^http://([a-z0-9]*\.)*www.yahoo.com/
+^http://([a-z0-9]*\.)*www.amazon.com/
+^http://([a-z0-9]*\.)*www.about.com/
+^http://([a-z0-9]*\.)*www.bartleby.com/
+^http://([a-z0-9]*\.)*www.cnn.com/
+^http://([a-z0-9]*\.)*www.download.com/
+^http://([a-z0-9]*\.)*www.reference.com/
+^http://([a-z0-9]*\.)*www.weather.com/
+^http://([a-z0-9]*\.)*www.nih.gov/
+^http://([a-z0-9]*\.)*www.usa.gov/
+^http://([a-z0-9]*\.)*www.monster.com/
+^http://([a-z0-9]*\.)*www.time.com/time/
+^http://([a-z0-9]*\.)*www.boerwar.us

shoppinglist.txt (in the urls folder)
http://en.wikipedia.org
http://www.google.com
http://search.yahoo.com/
http://www.yahoo.com/
http://www.amazon.com/
http://www.about.com/
http://www.bartleby.com/
http://www.cnn.com/
http://www.download.com/
http://www.reference.com/
http://www.wikipedia.org/
http://www.weather.com/
http://www.nih.gov/
http://www.usa.gov/
http://www.monster.com/
http://www.time.com/time/
http://boerwar.us

the nutch-site.xml (/usr/nutch-0.8.1/conf)
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
 <name>searcher.dir</name>
 <value>"/usr/nutch-0.8.1/crawl/"</value>
</property>
<property>
 <name>plugin.includes</name>

<value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
</property>

<property>
 
 <name>http.agent.name</name>
 
 <value>Kate</value>
 
 <description>Kate H.
   
 </description>
 
</property>



<property>
 
 <name>http.agent.description</name>
 
 <value>Nutch spiderman</value>
 
 <description> Nutch spiderman
   
 </description>
 
</property>







<property>
 
 <name>http.agent.email</name>
 
 <value>MyEmail</value>
 
 <description>kateiafrika@(protected)
   
 </description>
 
</property>

</configuration>


Thanks in advance-




--
Sent from the Nutch - User mailing list archive at Nabble.com.

©2008 java2.5341.com - Jax Systems, LLC, U.S.A.