Author Login
Post Reply
Hi I am new to Nutch. But have played around with it. So far really like
the tool.
I would like to be able to deep crawl a couple of sites, and then also
spider crawl sites. So that the end result is a index with a large portion
of several sites and more of an organic spider of the rest.
I have tried to do this in several ways, I have used the crawl command and
set depth level etc, which work I get a valid index and results.
I have also injected the individual URLS of the starting sites into the
crawldb and iterated through generate/fetch/update sequence, however in this
case it covers the whole web index, but it doesn't seem to add any
additional depth on the starting URLS. Which is an issue.
When I have tried to merge the Crawl results into the generate/fetch/update
results, I get errors.
Is there anyway to do this? Also is there anyway to set a priority on
certain sites, something like these need to be updated daily and the rest of
these weekly?
thank you in advance for any help.
-John