Hi,
I trying to create a index with a few ODT files but nutch identify the ODT
files as ZIP content type. Can someone help me looking whats wrong with my
configuration xml.
Thanks
Alexandre Haguiar
Error parsing: http://localhost/arquivos/testOO.sxw:
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/zip url=http://localhost/arquivos/testOO.sxw
at
org.apache.nutch.parse.ParseUtil.parse (
ParseUtil.java:74)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
Error parsing: http://localhost/arquivos/softwarelivre.odt:
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/zip url=http://localhost/arquivos/softwarelivre.odt
at
org.apache.nutch.parse.ParseUtil.parse (
ParseUtil.java:74)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
Error parsing: http://localhost/arquivos/ODF.odt:
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/zip url=http://localhost/arquivos/ODF.odt
at
org.apache.nutch.parse.ParseUtil.parse (
ParseUtil.java:74)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
crawl-urlfilter.txt
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*localhost/arquivos
nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>SERPRO Busca</value>
<description>Sistema de Busca do SERPRO
</description>
</property>
<property>
<name>http.agent.description</name>
<value>SERPRO Spiderman</value>
<description>SERPRO spiderman
</description>
</property>
<property>
<name>http.agent.url</name>
<value>http://localhost/nutch </value>
<description>http://localhost/nutch
</description>
</property>
<property>
<name>http.agent.email</name>
<value>Email</value>
<description>alexandre.aguiar@(protected)
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-(httpclient|file)|urlfilter-(regex)|parse-(text|html|pdf|xml|msword|odt)|index-(basic)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be
truncated;
otherwise, no truncation at all.
</description>
</property>
</configuration>
--
Alexandre Haguiar