Author Login
Post Reply
Dennis Kubes wrote:
> If you are using more than one index then dedup will not work across
> indexes.
This is incorrect. DeleteDuplicates works just fine with multiple
indexes, assuming you process all indexes in the same run of
DeleteDuplicates, so that it has a global view of all input indexes.
A single index should dedup correctly unless the pages are not
> exact duplicates but near duplicates. The dedup process works on url
> and byte hash. If the content is even 1 byte different, it doesn't work.
This depends on the implementation of Signature. Indeed, the default
MD5HashSignature works this way.
>
> Near duplicate detection is another set of algorithms that haven't been
> implemented in Nutch yet.
Well, the existing TextProfileSignature can be used as a form of (crude)
near-duplicate detection, precisely because it is tolerant to small
changes in the input text.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com