Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Handling certain URLs in Nutch possibly with appropriate normalization?

Vijay Krishnan

2008-05-14

Replies: Find Java Web Hosting

Author LoginPost Reply
Hi all,

  I find that typing URLs in certain ways often gets nutch to bomb
even though it works fine in the browser and even when I try to open a
HTTPURLConnection to those URLs using Java. For example:

1. The url http://www.techcrunch.com/2008/05/14/nys-amazon-tax-takes-first-casualty-overstock-affiliates/
works fine when I try to index it using nutch but writing it as:
http://www.techcrunch.com/2008/05/14/nys-amazon-tax-takes-first-casualty-overstock-affiliates
(without the slash in the end) causes it to fail.

2. The url http://www.go2linux.org/fedora-centos-root-password-recovery
gets crawled and indexed properly whereas the url:
http://www.go2linux.org/fedora-centos-root-password-recovery/ fails.

  As I mentioned, all of these work fine when I try to open an
HTTPURLConnection to them from java. Is there a simple patch I can use
for cases like this?

  In addition, it appears that nutch does some simple URL
normalization like adding a slash to the end of a domain name. Is it
easy to call the URLNormalizer of Nutch independently of the crawling
and indexing process? A pointer to the class/method will be very
useful.


Thanks,
Vijay
http:/www.cs.stanford.edu/~vijayk
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.