Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

problem with RegExURLFilter class

ajaxtrend

2008-10-20

Replies: Find Java Web Hosting

Author LoginPost Reply
Hi,
 I am somehow facing a strange problem using regex for urls mentioned in crawl-urlfilter.txt. Before using any regx for urls, I test them in a standalone class and they work correctly i.e. pattern.matcher(url).find() returns true.
But when the same url and regex is used during crawling, it returns false. I am not sure how it behaves differently.
Let me give an example

RegEx in crawl-urlfilter.txt :

^http://bangalore.locanto.in/(used-cars|ID_\\d+)/((\\d*/(\\d+/)*)|(.*.html))

URL: http://bangalore.locanto.in/used-cars/902/

During standalone testing(not in nutch environment), attern.matcher(url).find() returns true. However in nucth environment it returns false.

Appreciate your help on this.

- RB

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.