Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Error Crawling RTF Documents

V Sridhar

2008-08-22


Author LoginPost Reply
Hi,

I have been trying out Nutch to work with indexing Microsoft VSS Shadow folders.
Wanted to check out a few issues in which I was not able to make much progress


Rgds,
Sridhar




******************************************************************************************
Error
Crawling rtf documents

Error parsing: <!-- URL SNIPPED -->:
failed(2,0): Can't be handled as Microsoft document. java.io.IOException:
Invalid header signature; read 7015536635646467195, expected
-2226271756974174256


Even
though i have added in nutch-site.xml
as

<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|rtf|zip|msexcel|mspowerpoint|msword|pdf|rss|swf)|index-basic|index-more|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular
expression naming plugin directory names to
include. Any
plugin not matching this expression is excluded.
In any case you
need at least include the nutch-extensionpoints plugin. By
default Nutch
includes crawling just HTML and plain text via HTTP,
and basic indexing
and search plugins. In order to use HTTPS please enable
protocol-httpclient,
but be aware of possible intermittent problems with the
underlying
commons-httpclient library.
</description>
</property>

and
also i have included in crawl-urlfilter.txt as
( +\.(ppt|doc|pdf|rtf|zip)$ )******************************************************************************************


   Did you know? You can CHAT without downloading messenger. Go to http://in.webmessenger.yahoo.com/
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.