Author Login
Post Reply
Hi, all
I get two questions here about url filter when inject.
First, I found that in inject , nutch uses "regex-urlfilter.txt" as its
default filter. And in that text , I found this regex:
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/
I can't understand why this type url will cause loops. If anybody knows
about this, please tell me.
Second, when I changed this file, the output of nutch doesn't show any
chang. And when I recomplied, it changed. This takes me 3 hours, can anybody
tell me why ?
--
Sent from the Nutch - User mailing list archive at Nabble.com.