Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

question: bin/generate and segments, /bin/fetch

Martin Kammerlander

2008-05-22

Replies: Find Java Web Hosting

Author LoginPost Reply

Hi

I have some questions because some things are not that clear to me (<-- newbie
:P )

I'm using nutch 0.8.1 currently.

first:

bin/nutch inject crawl/crawldb testurl/
bin/nutch generate crawl/crawldb crawl/segments -topN 50 -numFetchers 10

first one injects the seed urls in the WebDB.
second one: Now this creates me one single segment right out of the seed URLs.
numFetchers is deprecated and not in use anymore as far as I understood...so it
seems to have no effect.

Now my questions: does it always just generate me one single segment or can
there be more...if there can be more segments on what depends the number of
created segments?

Does one single segment contain multiple fetchlists?

Furtermore: -topN 50 means that if I have for example one single seed url and if
this page is fetched and contains lets say 100 outlinks then only 50 of them
will be considered to be parsed in the next iteration. Is this correct?

second:

bin/nutch fetch crawl/segments/segment_number -threads 10

At the moment I assume, and based on testing it should be like that, that by
running bin/generate one single segment is created. that single segment
contains various fetchlists. Those fetchlists afterwards on the operation
bin/fetch are splitted on the different threads (example: segment contains 2
fetchlists and we have 2 threads then 1 thread gets one list and second thread
gets the other list) Is this right?


thx for your help!

best regards
martin


©2008 java2.5341.com - Jax Systems, LLC, U.S.A.