Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Re: pages with duplicate content in search results

Andrzej Bialecki

2008-09-25

Replies: Find Java Web Hosting

Author LoginPost Reply
Edward Quick wrote:
>
>> Dennis Kubes wrote:
>>> If you are using more than one index then dedup will not work across
>>> indexes.
>> This is incorrect. DeleteDuplicates works just fine with multiple
>> indexes, assuming you process all indexes in the same run of
>> DeleteDuplicates, so that it has a global view of all input indexes.
>>
>>   A single index should dedup correctly unless the pages are not
>>> exact duplicates but near duplicates. The dedup process works on url
>>> and byte hash. If the content is even 1 byte different, it doesn't work.
>> This depends on the implementation of Signature. Indeed, the default
>> MD5HashSignature works this way.
>>
>>> Near duplicate detection is another set of algorithms that haven't been
>>> implemented in Nutch yet.
>> Well, the existing TextProfileSignature can be used as a form of (crude)
>> near-duplicate detection, precisely because it is tolerant to small
>> changes in the input text.
>
> Thanks Andrzej.
> How do you tell Nutch to use the TextProfileSignature instead of MD5HashSignature for deduplicating?

See the following property in your nutch-site.xml: db.signature.class.


--
Best regards,
Andrzej Bialecki   <><
___. ___ ___ ___ _ _  __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com

©2008 java2.5341.com - Jax Systems, LLC, U.S.A.