Author Login
Post Reply
On Mon, Sep 29, 2008 at 9:19 PM, Kevin MacDonald <kevin@(protected):
> Once I have done a crawl I have a need to pass all of the raw HTML and
> javascript that has been fetched through a custom parser. During a fetch
> does nutch store all of the raw content including HTML tags on disk?
Yes, if you have fetcher.store.content set to true (which is true by default).
Raw content of a page will be saved under <segment>/content directory.
To reach a particular content, you may try this
bin/nutch readseg -get <segment> <url> -noparse -noparsedata -nofetch
-nogenerate -noparsetext
> Thanks
>
> Kevin
>
--
Doğacan Güney