Author Login
Post Reply
Hello,
I am using nutch to crawl some internet websites ending in a specific top level domain and I am using the topN option with the generate command in order not to get a never ending crawl with the following script http://wiki.apache.org/nutch/Nutch_0%2e9_Crawl_Script_Tutorial
So I got now topN set to 3000 and actually I was expecting that if I set topN to 3000 my index will also grow bigger of 3000 documents. This is not the case and that's what I wanted if someone knows why ?
For example I did a crawl of 4 times with topN at 3000, this means a total of 12000 URLs and finally the index grew of only 1911 documents.
Also I noticed in the log files that some URLs get crawled more than once even if it's exactly the same URL. Why is that ? Because actually I didn't even reach yet the db.default.fetch.interval of 30 days... I am only testing since 2 days now.
Many thanks for the hint.
Regards