Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

unable to correctly fetch https pages

POIRIER David

2008-05-15

Replies: Find Java Web Hosting

Author LoginPost Reply
Hello,



I'm trying to fetch a web site using the https protocol. I'm using Nutch
version 0.9 and I activated the protocol-httpclient plugin. The hadoop
logs are set in debug mode.



When checking the logs I can see that my seed url seems to be fetch:

"2008-05-15 12:19:04,341 INFO fetcher.Fetcher - fetching
https://www.aWebSite.xyz/aPage.htm"



But none of the links on this page are actually found and the process
finally crash with the following error:



java.lang.ArrayIndexOutOfBoundsException: -1

       at
org.apache.lucene.index.MultiReader.isDeleted (MultiReader.java:113)

       at
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.nex
t(DeleteDuplicates.java:176)

       at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)

       at org.apache.hadoop.mapred.MapRunner.run (MapRunner.java:46)

       at org.apache.hadoop.mapred.MapTask.run (MapTask.java:175)

       at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)



I am 100% certain that the links on the seed page are not excluded
because of the regex rules used by the urlfilter-regex plugin. I tried
using the urlfilter-prefix and the urlfilter-suffix plugin, I wasn't
luckier.



I found that a bug (NUTCH-593) generating the same error was fixed by
Andrzej Bialecki in February, could this fix help me? What is the
easiest way for me to get this fix without actually using the complete,
and I guess still unstable, version 1.0 of Nutch?



Thanks,



David

©2008 java2.5341.com - Jax Systems, LLC, U.S.A.