Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Ignoring a url in the crawl

sangeet

2008-09-29

Replies: Find Java Web Hosting

Author LoginPost Reply

I'm having a hard time trying to avoid crawling a particular url.
In regex-urlfilter.txt I added the following to ignore it.
-^http://([a-z0-9]*\.)*bhejacry.com/forums/

This url is not in the list in my urls directory. I also have
'db.ignore.external.links' set to 'true'.

However, I still see the following during the crawl

fetching
http://www.bhejacry.com/forums/memberlist.php?mode=viewprofile&u=2774
fetching http://www.bhejacry.com/forums/memberlist.php?mode=viewprofile&u=96

How do I ignore these urls?
--
Sent from the Nutch - User mailing list archive at Nabble.com.

©2008 java2.5341.com - Jax Systems, LLC, U.S.A.