Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

RTF Files - Java io exception - Invalid Header Signature

V Sridhar

2008-08-24


Author LoginPost Reply
Hi,

Wanted to see if others have got an "Invalid Header Signature" issue with RTF docs.
This is coming up frequently - and when I try opening the RTF, that opens fine in Wordpad etc.


Rgds,
Sridhar


***************************************
Error Crawling rtf documents

Error parsing: test.rtf :
failed(2,0): Can't be handled as Microsoft document. java.io.IOException:
Invalid header signature; read 7015536635646467195, expected
-2226271756974174256


Even
though i have added in nutch-site.xml as

<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|rtf|zip|msexcel|mspowerpoint|msword|pdf|rss|swf)|index-basic|index-more|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular
expression.
</description>
</property>

and i have included in crawl-urlfilter.txt as
( +\.(ppt|doc|pdf|rtf|zip)$ )
***********************************************************


   Unlimited freedom, unlimited storage. Get it now, on http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html/
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.