Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Duplicate pages in result of queries

vishal vachhani

2008-09-21


Author LoginPost Reply
Hi,

Is this bug or I m missing something ?

I have crawled many urls using Nutch-0.9. When I query the index created
using the crawl, some results are duplicate.

How nutch decides the urls are duplicate ? Is it on URL string matching or
based on content of pages?

for example content of the pages are same but urls are not same because of
"/","//" and "///".

http://www.indianholiday.com/india-wildlife-holidays/index.html
                          ^^^
http://www.indianholiday.com//india-wildlife-holidays/index.html
                          ^^^^
http://www.indianholiday.com///india-wildlife-holidays/index.html
                           ^^^^

Any idea how to remove this kind of duplicate pages from the crawl.

Thanks in advance!!

--
Thanks and Regards,
Vishal Vachhani
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.