Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Newbie question: crawling sites like amazon.com without leaving site

Jim Van Sciver

2008-10-03

Replies: Find Java Web Hosting

Author LoginPost Reply
Amazon.com, as a common example, has pages with links that do not
include the www.amazon.com prefix. The prefix is automatically
prepended by the page upon reference and the subsequent composite link
successfully resolves.

I think I am observing that Nutch can crawl these pages if the
crawl-urlfilter.txt patterns are weakened by not requiring an
amazon.com in the URL filter but then one begins crawling out of the
amazon.com site.

Does anyone have a suggestion for a crawl-urlfilter pattern that
achieves my desired goal or another mechanism for doing so? Or
perhaps I am misunderstanding, in which case an explanation would be
appreciated.

Thank you, in advance.
Jim Van Sciver
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.