Author Login
Post Reply
My nutch is on my localhost, and seems to be running fine...
Here is what is very strange: out of 18 websites i put in the
crawl-urlfilter.txtm and in my urls folder, only 3 websites come up in the
search, and one of them is even not on my list... Very weird! Please take a
look at my configurations(below) and see if you have any suggestions. (i
suspect that i need to recrawl or something, but the recrawl script on wiki
nutch didn't work. Also, should't google results come up too? )
The 3 websites that are searched are http://www.horse.com,
http://en.wikipedia.org, and this one, which is not my lsit:
http://www.ansi.okstate.edu/.
Also, if i edit
/opt/apache-tomcat-5.5.16/webapps/nutch-0.8.1/WEB-INF/classes/nutch-site.xml,
then my nutch dosn't search at all! it just says Hits 0-0 (out of about 0
total matching pages): .
Here are some files i edited:
crawl-urlfilter.txt
accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*en.wikipedia.org/
+^http://([a-z0-9]*\.)*www.google.com/
+^http://([a-z0-9]*\.)*www.search.yahoo.com/
+^http://([a-z0-9]*\.)*www.apache.org/
+^http://([a-z0-9]*\.)*www.yahoo.com/
+^http://([a-z0-9]*\.)*www.amazon.com/
+^http://([a-z0-9]*\.)*www.about.com/
+^http://([a-z0-9]*\.)*www.bartleby.com/
+^http://([a-z0-9]*\.)*www.cnn.com/
+^http://([a-z0-9]*\.)*www.download.com/
+^http://([a-z0-9]*\.)*www.reference.com/
+^http://([a-z0-9]*\.)*www.weather.com/
+^http://([a-z0-9]*\.)*www.nih.gov/
+^http://([a-z0-9]*\.)*www.usa.gov/
+^http://([a-z0-9]*\.)*www.monster.com/
+^http://([a-z0-9]*\.)*www.time.com/time/
+^http://([a-z0-9]*\.)*www.boerwar.us
shoppinglist.txt (in the urls folder)
http://en.wikipedia.org
http://www.google.com
http://search.yahoo.com/
http://www.yahoo.com/
http://www.amazon.com/
http://www.about.com/
http://www.bartleby.com/
http://www.cnn.com/
http://www.download.com/
http://www.reference.com/
http://www.wikipedia.org/
http://www.weather.com/
http://www.nih.gov/
http://www.usa.gov/
http://www.monster.com/
http://www.time.com/time/
http://boerwar.us
the nutch-site.xml (/usr/nutch-0.8.1/conf)
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>searcher.dir</name>
<value>"/usr/nutch-0.8.1/crawl/"</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
</property>
<property>
<name>http.agent.name</name>
<value>Kate</value>
<description>Kate H.
</description>
</property>
<property>
<name>http.agent.description</name>
<value>Nutch spiderman</value>
<description> Nutch spiderman
</description>
</property>
<property>
<name>http.agent.email</name>
<value>MyEmail</value>
<description>kateiafrika@(protected)
</description>
</property>
</configuration>
Thanks in advance-
--
Sent from the Nutch - User mailing list archive at Nabble.com.