I'm trying to fetch a web site using the https protocol. I'm using Nutch version 0.9 and I activated the protocol-httpclient plugin. The hadoop logs are set in debug mode.
When checking the logs I can see that my seed url seems to be fetch:
"2008-05-15 12:19:04,341 INFO fetcher.Fetcher - fetching https://www.aWebSite.xyz/aPage.htm"
But none of the links on this page are actually found and the process finally crash with the following error:
I am 100% certain that the links on the seed page are not excluded because of the regex rules used by the urlfilter-regex plugin. I tried using the urlfilter-prefix and the urlfilter-suffix plugin, I wasn't luckier.
I found that a bug (NUTCH-593) generating the same error was fixed by Andrzej Bialecki in February, could this fix help me? What is the easiest way for me to get this fix without actually using the complete, and I guess still unstable, version 1.0 of Nutch?