Author Login
Post Reply
Hi Nutch-user.
I've been crawling our internal sites for a while now and the index is
growing rapidly. I have filled the 100G partition I have been alotted
and am looking at finding a sustainable way to maintain the indexes at
around this size rather than continuously expanding them. I read in
some of the script examples that it is common practice to rebuild the
indexes from scratch about every 6 months. Is this still the case?
Also, have there been any developments in the recrawl process since nutch 0.9?
Currently recrawling almost doubles the size of my data before merging
back down to a single set. I've seen some java code examples of
recrawl functionality built into the nutch classes rather than just in
a controlling script, but has this made its way into the dev branches?
One last question. Has anyone managed to get the stemming plugin
(written by Howie Wang for 0.7, and updated for 0.8 by Matthew Holt)
to work in nutch-1.0-dev? I'm eager to try it, but my java skills are
not good enough to figure out why my modifications aren't working.
Thanks in advance for the information :)