Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Re: pages with duplicate content in search results

Dennis Kubes

2008-09-25

Replies: Find Java Web Hosting

Author LoginPost Reply
If you are using more than one index then dedup will not work across
indexes. A single index should dedup correctly unless the pages are not
exact duplicates but near duplicates. The dedup process works on url
and byte hash. If the content is even 1 byte different, it doesn't work.

Near duplicate detection is another set of algorithms that haven't been
implemented in Nutch yet. On the query site you can set hte hitsPerSite
to 1 and it should limit your search results.

Dennis

Edward Quick wrote:
> Hi,
>
> Eventhough I ran nutch dedup on my index, I still have pages with different urls but the exactly the same content (see search result example below). From what I read up on dedup this shouldn't happen though as it deletes the url with the lowest score. Is there anything else I can try to get rid of these?
>
> Thanks,
> Ed.
>
> Item Document :- Client - TeraTerm Pro
> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards Online  Employee Self Service     ESS Home ... Description Document   Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where printing or keymapping is an issue, TeraTerm ...
> http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument (cached) (explain) (anchors)
>
>
>
> Item Document :- Client - TeraTerm Pro
> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards Online  Employee Self Service     ESS Home ... Description Document   Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where printing or keymapping is an issue, TeraTerm ...
> http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument (cached) (explain) (anchors)
>
> _________________________________________________________________
> Make a mini you and download it into Windows Live Messenger
> http://clk.atdmt.com/UKM/go/111354029/direct/01/
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.