Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Quick Questions about NutchAnalysis.jj

student_t

2008-11-10


Author LoginPost Reply

Hi Nutch Experts,

I understood that the default analyzer used during Indexing is
NutchDocumentAnalyzer. And I like to put in more control in the term()
parser specification. E.g., removing non-ASCII characters. Would you please
shed some light on how to achieve this?

I looked at the nonTerm() function but thought it is used only for
QueryParser. And I think the "term()" function is what I need to change. My
thinking is to let the analyzer eat those non-ASCII characters but don't
know how to do that.

There are some unicode entries in the TOKEN definition, including
"\u0a66"-"\u0a6f" in the digit section. I wonder what's going to happen if I
remove these non-ASCII characters from the definition.

Thanks in advance for your help!

student_t

--
Sent from the Nutch - User mailing list archive at Nabble.com.

©2008 java2.5341.com - Jax Systems, LLC, U.S.A.