Author Login
Post Reply
Hi
I have some questions because some things are not that clear to me (<-- newbie
:P )
I'm using nutch 0.8.1 currently.
first:
bin/nutch inject crawl/crawldb testurl/
bin/nutch generate crawl/crawldb crawl/segments -topN 50 -numFetchers 10
first one injects the seed urls in the WebDB.
second one: Now this creates me one single segment right out of the seed URLs.
numFetchers is deprecated and not in use anymore as far as I understood...so it
seems to have no effect.
Now my questions: does it always just generate me one single segment or can
there be more...if there can be more segments on what depends the number of
created segments?
Does one single segment contain multiple fetchlists?
Furtermore: -topN 50 means that if I have for example one single seed url and if
this page is fetched and contains lets say 100 outlinks then only 50 of them
will be considered to be parsed in the next iteration. Is this correct?
second:
bin/nutch fetch crawl/segments/segment_number -threads 10
At the moment I assume, and based on testing it should be like that, that by
running bin/generate one single segment is created. that single segment
contains various fetchlists. Those fetchlists afterwards on the operation
bin/fetch are splitted on the different threads (example: segment contains 2
fetchlists and we have 2 threads then 1 thread gets one list and second thread
gets the other list) Is this right?
thx for your help!
best regards
martin