Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

periodically re-crawl several domains with different frequencies

Marcel T

2008-05-07

Replies: Find Java Web Hosting

Author LoginPost Reply

Hi,
I want to crawl and build index for several domains by nutch's intranet crawling method. Since those domains update from time to time, I want to re-crawl them periodically but with different frequencies. Say, for domain A, I re-crawl it every week, but for domain B, re-crawling is done every other day, for example. Two questions here
1) When I do crawling with the same direction, old index is completely removed. Is there any way I can just update the crawled URLs from the existing index?
2) How to set different crawling frequency for different domains? Should I crawl them individually, and merge them? Or I can configure it in nutch?

Many thanks!
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.