Hi,
I am running Nutch 0.9 and am attempting to use it to index files on my local file system without much luck. I believe I have configured things correctly, however, no files are being indexed and no errors being reported. Note that I have looked thru the various posts on this topic on the mailing list and tired various variations on the configuration.
I am providing details of my configuration and log files below. I would appreciate any insight people might have.
Best,
mw
Details:
OS: Windows Vista (note I have turned off defender and firewall)
<comand> bin/nutch crawl urls -dir crawl_results -depth 4 -topN 500 >& logs/crawl.log
urls files contains only
```````````````````````````````````````````````````
file:///C:/MyData/```````````````````````````````````````````````````
Nutch-site.xml
`````````````````````````````````````
<property>
<name>http.agent.url</name>
<value></value>
<description>none</description>
</property>
<property>
<name>http.agent.email</name>
<value>none</value>
<description></description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-file|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>file.content.limit</name> <value>-1</value>
</property>
</configuration>
```````````````````````````````````````````````````
crawl-urlfilters.txt
```````````````````````````````````````````````````
# The url filter file used by the crawl command.
# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file:, ftp:, & mailto: urls
# -^(file|ftp|mailto):
# skip http:, ftp:, & mailto: urls
-^(http|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
# -.*(/.+?)/.*?\1/.*?\1/
# accept hosts in MY.DOMAIN.NAME
# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
# skip everything else
# -.
# get everything else
+^file:///C:/MyData/*
-.*
```````````````````````````````````````````````````
Want to do more with Windows Live? Learn “10 hidden secrets” from Jamie.
Learn Now