Author Login
Post Reply
Amazon.com, as a common example, has pages with links that do not
include the www.amazon.com prefix. The prefix is automatically
prepended by the page upon reference and the subsequent composite link
successfully resolves.
I think I am observing that Nutch can crawl these pages if the
crawl-urlfilter.txt patterns are weakened by not requiring an
amazon.com in the URL filter but then one begins crawling out of the
amazon.com site.
Does anyone have a suggestion for a crawl-urlfilter pattern that
achieves my desired goal or another mechanism for doing so? Or
perhaps I am misunderstanding, in which case an explanation would be
appreciated.
Thank you, in advance.
Jim Van Sciver