Author Login
Post Reply
Hi,
For our implementation of Nutch, I found the need to have access to the
cached versions of the crawled documents. I need to run a post-processing
task on top of all the cached documents.
I was wondering if this is the right way ahead:
1. Using getContent() method of FetchedSegments class to get the content of
text/html documents.
2. Using getParseText() method of FetchedSegments class to get text of
other document formats.
Since this is a class under Nutch.Searcher, would this be helpful only in
getting the documents searched (i.e., HitResults), or is there a way to get
all the indexed documents?
Or, is there a simpler or better way than this?
Regards,
Venkateshprasanna.
--
Sent from the Nutch - User mailing list archive at Nabble.com.