Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Only crawling out from pages that meet a certain criteria

John Thompson

2008-07-04


Author LoginPost Reply
Is there a way for me to prevent nutch from fetching outlinks from pages
that I decide to be irrelevant (where I make the decision that a page is
irrelevant during the parsing of that page with my parse filter)? I realize
that I can stop nutch from indexing such pages, but I believe the index is
separate from the structure that determines what new pages should be
fetched.

Best,
John
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.