Author Login
Post Reply
In search.jsp lines 116-119:
int hitsPerSite = 2; // max hits per site
String hitsPerSiteString = request.getParameter("hitsPerSite");
if (hitsPerSiteString != null)
hitsPerSite = Integer.parseInt(hitsPerSiteString);
Hope that helps.
Dennis
vishal vachhani wrote:
> Dennis,
> I am facing same problem, in my crawl content of some urls are
> same but urls are different. Could you please tell me how I can set
> hitsPersite to 1 . ?
>
> --Vishal
>
> On Thu, Sep 25, 2008 at 6:12 PM, Dennis Kubes <kubes@(protected):
>
>> If you are using more than one index then dedup will not work across
>> indexes. A single index should dedup correctly unless the pages are not
>> exact duplicates but near duplicates. The dedup process works on url and
>> byte hash. If the content is even 1 byte different, it doesn't work.
>>
>> Near duplicate detection is another set of algorithms that haven't been
>> implemented in Nutch yet. On the query site you can set hte hitsPerSite to
>> 1 and it should limit your search results.
>>
>> Dennis
>>
>>
>> Edward Quick wrote:
>>
>>> Hi,
>>>
>>> Eventhough I ran nutch dedup on my index, I still have pages with
>>> different urls but the exactly the same content (see search result example
>>> below). From what I read up on dedup this shouldn't happen though as it
>>> deletes the url with the lowest score. Is there anything else I can try to
>>> get rid of these?
>>>
>>> Thanks,
>>> Ed.
>>>
>>> Item Document :- Client - TeraTerm Pro
>>> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
>>> Online Employee Self Service ESS Home ... Description Document
>>> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
>>> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
>>> printing or keymapping is an issue, TeraTerm ...
>>>
>>> http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) (explain) (anchors)
>>>
>>>
>>>
>>> Item Document :- Client - TeraTerm Pro
>>> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
>>> Online Employee Self Service ESS Home ... Description Document
>>> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
>>> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
>>> printing or keymapping is an issue, TeraTerm ...
>>>
>>> http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) (explain) (anchors)
>>> _________________________________________________________________
>>> Make a mini you and download it into Windows Live Messenger
>>> http://clk.atdmt.com/UKM/go/111354029/direct/01/
>>>
>