Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Optimizing nutch

Kevin MacDonald

2008-09-13

Replies: Find Java Web Hosting

Author LoginPost Reply
Hello,

I need to configure nutch to be as fast as possible while operating on a
single machine. My primary purpose is to dump the link database and analyze
links leading from each of the urls that I want crawled. I do not need the
indexing or searching capabilities of nutch right now.

What I see in the logs is rather strange. I am crawling approximately 3500
urls to a depth of 1 only. All of the fetching operations complete in just
over 3 minutes, which is about 1000 fetches per minute. That seems very
reasonable. However, following that there are long periods of inactivity.
From the last fetch to when I see "fetcher.Fetcher - Fetcher: done" about 10
minutes elapses with no log activity and the CPU sitting at zero
utilitization! It then takes about an additional 5 minutes to update the
CrawlDb. I have tried this using 10 threads and 100 threads. The results are
similar.

Can anyone explain what is happening here? What would cause nutch to sit for
so long doing nothing?

Kevin
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.