Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

two questions about nutch url filter when inject

beansproud

2008-06-18

Replies: Find Java Web Hosting

Author LoginPost Reply

Hi, all

  I get two questions here about url filter when inject.

  First, I found that in inject , nutch uses "regex-urlfilter.txt" as its
default filter. And in that text , I found this regex:
  # skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
  -.*(/.+?)/.*?\1/.*?\1/
  I can't understand why this type url will cause loops. If anybody knows
about this, please tell me.

  Second, when I changed this file, the output of nutch doesn't show any
chang. And when I recomplied, it changed. This takes me 3 hours, can anybody
tell me why ?


--
Sent from the Nutch - User mailing list archive at Nabble.com.

©2008 java2.5341.com - Jax Systems, LLC, U.S.A.