Author Login
Post Reply
Hi Nutch Experts,
I understood that the default analyzer used during Indexing is
NutchDocumentAnalyzer. And I like to put in more control in the term()
parser specification. E.g., removing non-ASCII characters. Would you please
shed some light on how to achieve this?
I looked at the nonTerm() function but thought it is used only for
QueryParser. And I think the "term()" function is what I need to change. My
thinking is to let the analyzer eat those non-ASCII characters but don't
know how to do that.
There are some unicode entries in the TOKEN definition, including
"\u0a66"-"\u0a6f" in the digit section. I wonder what's going to happen if I
remove these non-ASCII characters from the definition.
Thanks in advance for your help!
student_t
--
Sent from the Nutch - User mailing list archive at Nabble.com.