Author Login
Post Reply
Hi Guys
I am creating an aplication to search online as well as local files.
I am using the following configurations and procedures.
The problem is, eventhough i can search online contents, the local files are
not getting indexed:
Can anyone findout some problem in these steps or some steps i missed.
Any comment will be acceptable:
Thanks in advance
C:\nutch-0.9\ --> The nutch source
C:\apache-tomcat-6.0.16\ --> Tomact
C:\cygwin --> Cygwin
C:\LocalSearch\localfiles --> Some sample html and text files
C:\nutch-0.9\crawl --> Folder automatically created for indexing
Now I did the following steps:
1)Created a folder called urls inside C:\nutch-0.9
2)Created a file, source.txt, with content:
http://www.apache.org
file:///c:/LocalSearch/localfiles/
3)Edited conf/crawl-urlfilter.txt and added the following entries:
# accept hosts in MY.DOMAIN.NAME
+^file:///c:/LocalSearch/localfiles/*
+^http://([a-z0-9]*\.)*apache.org/
4) Edit conf/nutch-site.xml and add the following entries inside the
configuration tab:
<property>
<name>searcher.dir</name>
<value>C:\nutch-0.9\crawl</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-file|protocol-httpclient|protocol-http|urlfilter-regex|parse-(text|html|js|msword|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>file.content.limit</name> <value>-1</value>
</property>
<property>
<name>http.agent.name</name>
<value>MySearch</value>
<description>My Search Engine</description>
</property>
<property>
<name>http.agent.description</name>
<value></value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value>www.unfamiliarfacts.blogspot.com</value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
5) Build the project with 'ant' command
6) Create the war file with 'ant war' command
8) From cygwin console:
bin/nutch crawl urls -dir crawl -depth 3 -topN 10
8) copy the war file from C:\nutch-0.9\build to tomcat's webapps.
9) made sure that
C:\apache-tomcat-6.0.16\webapps\nutch-0.9\WEB-INF\classes\nutch-site.xml
contains:
<property>
<name>searcher.dir</name>
<value>C:\nutch-0.9\crawl</value>
</property>
10 Access the search tool from http://localhost:8080/nutch-0.9/
--
Sent from the Nutch - User mailing list archive at Nabble.com.