Hi,
I have been trying out Nutch to work with indexing Microsoft VSS Shadow folders.
Wanted to check out a few issues in which I was not able to make much progress
Rgds,
Sridhar
******************************************************************************************
Error
Crawling rtf documents
Error parsing: <!-- URL SNIPPED -->:
failed(2,0): Can't be handled as Microsoft document.
java.io.IOException:
Invalid header signature; read 7015536635646467195, expected
-2226271756974174256
Even
though i have added in nutch-site.xml
as
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|rtf|zip|msexcel|mspowerpoint|msword|pdf|rss|swf)|index-basic|index-more|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular
expression naming plugin directory names to
include. Any
plugin not matching this expression is excluded.
In any case you
need at least include the nutch-extensionpoints plugin. By
default Nutch
includes crawling just HTML and plain text via HTTP,
and basic indexing
and search plugins. In order to use HTTPS please enable
protocol-httpclient,
but be aware of possible intermittent problems with the
underlying
commons-httpclient library.
</description>
</property>
and
also i have included in crawl-urlfilter.txt as
( +\.(ppt|doc|pdf|rtf|zip)$ )******************************************************************************************
Did you know? You can CHAT without downloading messenger. Go to http://in.webmessenger.yahoo.com/