Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

topN question

ML mail

2008-11-17

Replies: Find Java Web Hosting

Author LoginPost Reply
Hello,

I am using nutch to crawl some internet websites ending in a specific top level domain and I am using the topN option with the generate command in order not to get a never ending crawl with the following script http://wiki.apache.org/nutch/Nutch_0%2e9_Crawl_Script_Tutorial

So I got now topN set to 3000 and actually I was expecting that if I set topN to 3000 my index will also grow bigger of 3000 documents. This is not the case and that's what I wanted if someone knows why ?

For example I did a crawl of 4 times with topN at 3000, this means a total of 12000 URLs and finally the index grew of only 1911 documents.

Also I noticed in the log files that some URLs get crawled more than once even if it's exactly the same URL. Why is that ? Because actually I didn't even reach yet the db.default.fetch.interval of 30 days... I am only testing since 2 days now.

Many thanks for the hint.

Regards



   
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.