Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Doublets

Detlef Müller-Solger

2008-10-08

Replies: Find Java Web Hosting

Author LoginPost Reply
Hi,

in Germany it is reported, that one big show stopper for Nutch is the
fact, that there for example are often identical webpage’s which can be
addressed by different URLs. For example by requesting

www.xyz.de/information
or by
www.xyz.de/information/
or by
www.xyz.de/information/index

From my point of view due to the different URLs Nutch is indexing those
webpages unfortuneately three times. Is there a method to avoid the
indexing of these doublets? For example by comparing all information of
the webpage excluding the URL.

Note: A Fliter like "reduce URL generally of "/index"" is no solution
because in other cases of the same run "/index" may be needed or the
same Webpage can be adressed also by other URL Syntax.

Thanx

Detlef Müller-Solger

©2008 java2.5341.com - Jax Systems, LLC, U.S.A.