Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Deep Searching and whole web searches

John Martyniak

2008-06-11

Replies: Find Java Web Hosting

Author LoginPost Reply
Hi I am new to Nutch. But have played around with it. So far really like
the tool.
I would like to be able to deep crawl a couple of sites, and then also
spider crawl sites. So that the end result is a index with a large portion
of several sites and more of an organic spider of the rest.

I have tried to do this in several ways, I have used the crawl command and
set depth level etc, which work I get a valid index and results.

I have also injected the individual URLS of the starting sites into the
crawldb and iterated through generate/fetch/update sequence, however in this
case it covers the whole web index, but it doesn't seem to add any
additional depth on the starting URLS. Which is an issue.

When I have tried to merge the Crawl results into the generate/fetch/update
results, I get errors.

Is there anyway to do this? Also is there anyway to set a priority on
certain sites, something like these need to be updated daily and the rest of
these weekly?

thank you in advance for any help.

-John
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.