Author Login
Post Reply
Hi all,
I find that typing URLs in certain ways often gets nutch to bomb
even though it works fine in the browser and even when I try to open a
HTTPURLConnection to those URLs using Java. For example:
1. The url http://www.techcrunch.com/2008/05/14/nys-amazon-tax-takes-first-casualty-overstock-affiliates/
works fine when I try to index it using nutch but writing it as:
http://www.techcrunch.com/2008/05/14/nys-amazon-tax-takes-first-casualty-overstock-affiliates
(without the slash in the end) causes it to fail.
2. The url http://www.go2linux.org/fedora-centos-root-password-recovery
gets crawled and indexed properly whereas the url:
http://www.go2linux.org/fedora-centos-root-password-recovery/ fails.
As I mentioned, all of these work fine when I try to open an
HTTPURLConnection to them from java. Is there a simple patch I can use
for cases like this?
In addition, it appears that nutch does some simple URL
normalization like adding a slash to the end of a domain name. Is it
easy to call the URLNormalizer of Nutch independently of the crawling
and indexing process? A pointer to the class/method will be very
useful.
Thanks,
Vijay
http:/www.cs.stanford.edu/~vijayk