Author Login
Post Reply
Edward Quick wrote:
>
>> Dennis Kubes wrote:
>>> If you are using more than one index then dedup will not work across
>>> indexes.
>> This is incorrect. DeleteDuplicates works just fine with multiple
>> indexes, assuming you process all indexes in the same run of
>> DeleteDuplicates, so that it has a global view of all input indexes.
>>
>> A single index should dedup correctly unless the pages are not
>>> exact duplicates but near duplicates. The dedup process works on url
>>> and byte hash. If the content is even 1 byte different, it doesn't work.
>> This depends on the implementation of Signature. Indeed, the default
>> MD5HashSignature works this way.
>>
>>> Near duplicate detection is another set of algorithms that haven't been
>>> implemented in Nutch yet.
>> Well, the existing TextProfileSignature can be used as a form of (crude)
>> near-duplicate detection, precisely because it is tolerant to small
>> changes in the input text.
>
> Thanks Andrzej.
> How do you tell Nutch to use the TextProfileSignature instead of MD5HashSignature for deduplicating?
See the following property in your nutch-site.xml: db.signature.class.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com