Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Re: pages with duplicate content in search results

Andrzej Bialecki

2008-09-25

Replies: Find Java Web Hosting

Author LoginPost Reply
Dennis Kubes wrote:
> If you are using more than one index then dedup will not work across
> indexes.

This is incorrect. DeleteDuplicates works just fine with multiple
indexes, assuming you process all indexes in the same run of
DeleteDuplicates, so that it has a global view of all input indexes.

 A single index should dedup correctly unless the pages are not
> exact duplicates but near duplicates. The dedup process works on url
> and byte hash. If the content is even 1 byte different, it doesn't work.

This depends on the implementation of Signature. Indeed, the default
MD5HashSignature works this way.

>
> Near duplicate detection is another set of algorithms that haven't been
> implemented in Nutch yet.

Well, the existing TextProfileSignature can be used as a form of (crude)
near-duplicate detection, precisely because it is tolerant to small
changes in the input text.


--
Best regards,
Andrzej Bialecki   <><
___. ___ ___ ___ _ _  __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com

©2008 java2.5341.com - Jax Systems, LLC, U.S.A.