Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Help Please! Nutch crawl fails on Dedup

Rochelle Rees

2008-05-19

Replies: Find Java Web Hosting

Author LoginPost Reply
Hi there,

I have a problem with my crawl failing at:

Dedup adding indexes in: crawls/test/indexes
Exception in thread "main" java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob (JobClient.java:604)
 at
org.apache.nutch.indexer.DeleteDuplicates.dedup (DeleteDuplicates.java:43
9)
 at org.apache.nutch.crawl.Crawl.main (Crawl.java:135)

I have tried searching for threads with a similar problem and found a
number - however the only solution I could find was to install the
patches from:
https://issues.apache.org/jira/browse/NUTCH-525
However running deleteDups.patch and RededupUnitTest.patch made no
difference whatsoever.

Now, interestingly, my crawl runs fine on www.lovepigs.org.nz and
www.tegelchicken.co.nz, but fails when I try intranet.canterbury.ac.nz.

Intranet.canterbury.ac.nz requires authentication, so I ran the
NUTCH-559v0.5.patch file - however the error I have occurs with or
without this patch, and regardless of what I put in the
conf/httpclient-auth.xml file.

Does anyone have any ideas what I can do to fix this issue?

For reference, my conf/nutch-site.xml, conf/crawl-urlfilter.txt and
urls/urls.txt files are pasted below.

Please let me know if you need any further info.

--------------------------------------------
conf/nutch-site.xml
--------------------------------------------
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

<name>http.agent.name</name>

<value>University of Canterbury Intranet</value>

<description>
  University of Canterbury Intranet
</description>

</property>



<property>

<name>http.agent.description</name>

<value>Intranet for University of Canterbury</value>

<description> Intranet for University of Canterbury

</description>

</property>



<property>

<name>http.agent.url</name>

<value></value>

<description>

</description>

</property>



<property>

<name>http.agent.email</name>

<value>Web Support Email</value>

<description>websupport@(protected)

</description>

</property>

</configuration>
--------------------------------------------
--------------------------------------------

conf/crawl-urlfilter.txt
--------------------------------------------
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*intranet.canterbury.ac.nz/

# skip everything else
-.
--------------------------------------------
--------------------------------------------


urls/urls.txt
--------------------------------------------
http://intranet.canterbury.ac.nz

--------------------------------------------
--------------------------------------------

Regards
Rochelle Rees
Web Team, Student Recruitment and Development (SRD)
University of Canterbury, Te Whare Wananga o Waitaha
Rm: 419, Law Building
+64-3-364 2987 Ext: 6125
rochelle.rees@(protected)
http://www.canterbury.ac.nz/

For all web enquiries please contact:
websupport@(protected)
http://www.canterbury.ac.nz/web

©2008 java2.5341.com - Jax Systems, LLC, U.S.A.