Hi there,
I have a problem with my crawl failing at:
Dedup adding indexes in: crawls/test/indexes
Exception in thread "main"
java.io.IOException: Job failed!
at
org.apache.hadoop.mapred.JobClient.runJob (
JobClient.java:604)
at
org.apache.nutch.indexer.DeleteDuplicates.dedup (
DeleteDuplicates.java:43
9)
at
org.apache.nutch.crawl.Crawl.main (
Crawl.java:135)
I have tried searching for threads with a similar problem and found a
number - however the only solution I could find was to install the
patches from:
https://issues.apache.org/jira/browse/NUTCH-525
However running deleteDups.patch and RededupUnitTest.patch made no
difference whatsoever.
Now, interestingly, my crawl runs fine on www.lovepigs.org.nz and
www.tegelchicken.co.nz, but fails when I try intranet.canterbury.ac.nz.
Intranet.canterbury.ac.nz requires authentication, so I ran the
NUTCH-559v0.5.patch file - however the error I have occurs with or
without this patch, and regardless of what I put in the
conf/httpclient-auth.xml file.
Does anyone have any ideas what I can do to fix this issue?
For reference, my conf/nutch-site.xml, conf/crawl-urlfilter.txt and
urls/urls.txt files are pasted below.
Please let me know if you need any further info.
--------------------------------------------
conf/nutch-site.xml
--------------------------------------------
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>University of Canterbury Intranet</value>
<description>
University of Canterbury Intranet
</description>
</property>
<property>
<name>http.agent.description</name>
<value>Intranet for University of Canterbury</value>
<description> Intranet for University of Canterbury
</description>
</property>
<property>
<name>http.agent.url</name>
<value></value>
<description>
</description>
</property>
<property>
<name>http.agent.email</name>
<value>Web Support Email</value>
<description>websupport@(protected)
</description>
</property>
</configuration>
--------------------------------------------
--------------------------------------------
conf/crawl-urlfilter.txt
--------------------------------------------
# The url filter file used by the crawl command.
# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*intranet.canterbury.ac.nz/
# skip everything else
-.
--------------------------------------------
--------------------------------------------
urls/urls.txt
--------------------------------------------
http://intranet.canterbury.ac.nz
--------------------------------------------
--------------------------------------------
Regards
Rochelle Rees
Web Team, Student Recruitment and Development (SRD)
University of Canterbury, Te Whare Wananga o Waitaha
Rm: 419, Law Building
+64-3-364 2987 Ext: 6125
rochelle.rees@(protected)
http://www.canterbury.ac.nz/
For all web enquiries please contact:
websupport@(protected)
http://www.canterbury.ac.nz/web