Hi,
Wanted to see if others have got an "Invalid Header Signature" issue with RTF docs.
This is coming up frequently - and when I try opening the RTF, that opens fine in Wordpad etc.
Rgds,
Sridhar
***************************************
Error Crawling rtf documents
Error parsing: test.rtf :
failed(2,0): Can't be handled as Microsoft document.
java.io.IOException:
Invalid header signature; read 7015536635646467195, expected
-2226271756974174256
Even
though i have added in nutch-site.xml as
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|rtf|zip|msexcel|mspowerpoint|msword|pdf|rss|swf)|index-basic|index-more|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular
expression.
</description>
</property>
and i have included in crawl-urlfilter.txt as
( +\.(ppt|doc|pdf|rtf|zip)$ )
***********************************************************
Unlimited freedom, unlimited storage. Get it now, on http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html/