Author Login
Post Reply
Good afternoon all,
My installation of nutch appears to be ignoring the robots.txt for a site I'm crawling. (http://www.gardenanimals.co.nz/). Site has a robots.txt that contains
User-agent: *
Disallow: /bot-trap/
the hadoop.log contains
INFO http.Http - protocol.plugin.check.blocking = true
INFO http.Http - protocol.plugin.check.robots = true
so I assume I've configured nutch to honour the robots.txt file. But as this entry from crawldb shows
http://www.gardenanimals.co.nz/bot-trap/index.php Version: 5
Status: 3 (db_gone)
Fetch time: Mon Sep 01 13:01:45 GMT+12:00 2008
Modified time: Thu Jan 01 12:00:00 GMT+12:00 1970
Retries since fetch: 0
Retry interval: 7.0 days
Score: 0.003542109
Signature: null
Metadata: _pst_:robots_denied(18), lastModified=0: http://www.gardenanimals.co.nz/bot-trap/index.php
nutch has still gone and fetched a banned url, thus triggering a bot-trap. I've no idea as to what I've miss-configured / not configured, any pointers would be greatly appreciated. Below is my actual nutch-site.xml file if this helps.
Thanks
David
<configuration>
<property>
<name>http.agent.name</name><value>searchnz</value>
</property>
<property>
<name>http.robots.agents</name><value>searchnz,*</value>
</property>
<property>
<name>http.agent.description</name><value>searchnz</value>
</property>
<property>
<name>http.agent.url</name><value>http://www.searchnz.co.nz/</value>
</property>
<property>
<name>http.agent.email</name><value>robot@(protected)>
</property>
<property>
<name>http.verbose</name><value>true</value>
</property>
<property>
<name>http.robots.403.allow</name><value>false</value>
</property>
<property>
<name>fetcher.threads.fetch</name><value>50</value>
</property>
<property>
<name>db.default.fetch.interval</name><value>7</value>
</property>
<property>
<name>plugin.includes</name><value>protocol-http|parse-(text|html)|urlfilter-prefix|urlfilter-suffix|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>urlfilter.prefix.file</name><value>urlfilter-prefix.txt</value>
</property>
<property>
<name>urlfilter.suffix.file</name><value>urlfilter-suffix.txt</value>
</property>
</configuration>