Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

RE: pages with duplicate content in search results

Edward Quick

2008-09-25

Replies: Find Java Web Hosting

Author LoginPost Reply


> >
> > Dennis,
> >         I am facing same problem, in my crawl content of some urls are
> > same but urls are different. Could you please tell me how I can set
> > hitsPersite to 1 . ?
>
> I changed hitsPerSite to 0 in the search.jsp (to get rid of the 'show all hits' button). It might be possible to set this in the web.xml or nutch-site.xml though?
>
> >
> > --Vishal
> >
> > On Thu, Sep 25, 2008 at 6:12 PM, Dennis Kubes <kubes@(protected):
> >
> > > If you are using more than one index then dedup will not work across
> > > indexes. A single index should dedup correctly unless the pages are not
> > > exact duplicates but near duplicates. The dedup process works on url and
> > > byte hash. If the content is even 1 byte different, it doesn't work.
>
>
> I only have one index, and have only crawled one domain site which is the Intranet at my work.
> The pages definitely seem to be identical. I saved the source from both pages and the sizes were exactly the same too.

Also, just to add to this I checked the index with Luke which shows the two urls below with the same titles but different timestamps, digests and boosts. :-(

>
>
> > >
> > > Near duplicate detection is another set of algorithms that haven't been
> > > implemented in Nutch yet. On the query site you can set hte hitsPerSite to
> > > 1 and it should limit your search results.
> > >
> > > Dennis
> > >
> > >
> > > Edward Quick wrote:
> > >
> > >> Hi,
> > >>
> > >> Eventhough I ran nutch dedup on my index, I still have pages with
> > >> different urls but the exactly the same content (see search result example
> > >> below). From what I read up on dedup this shouldn't happen though as it
> > >> deletes the url with the lowest score. Is there anything else I can try to
> > >> get rid of these?
> > >>
> > >> Thanks,
> > >> Ed.
> > >>
> > >> Item Document :- Client - TeraTerm Pro
> > >> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
> > >> Online  Employee Self Service     ESS Home ... Description Document
> > >> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
> > >> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
> > >> printing or keymapping is an issue, TeraTerm ...
> > >>
> > >> http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) (explain) (anchors)
> > >>
> > >>
> > >>
> > >> Item Document :- Client - TeraTerm Pro
> > >> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
> > >> Online  Employee Self Service     ESS Home ... Description Document
> > >> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
> > >> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
> > >> printing or keymapping is an issue, TeraTerm ...
> > >>
> > >> http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) (explain) (anchors)
> > >> _________________________________________________________________
> > >> Make a mini you and download it into Windows Live Messenger
> > >> http://clk.atdmt.com/UKM/go/111354029/direct/01/
> > >>
> > >
>
> _________________________________________________________________
> Get all your favourite content with the slick new MSN Toolbar - FREE
> http://clk.atdmt.com/UKM/go/111354027/direct/01/

_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.