Author Login
Post Reply
Hello,
I'am trying to crawl a number of sites containing news. I would like to
index only specific pages based on the url, e.g.
http://www.volkskrant.nl/[a-z]+/article[0-9]+.ece/.+ . It seems that
when i configure this in the crawl-url filter nutch is unable to crawl
the complete site. (when there are no links between pages that match
this pattern). Is there another configuration option which permits nutch
to crawl the complete site and only index specific pages ?
Sebastiaan