Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Unable to search LOCAL FILES

convoyer

2008-08-25

Replies: Find Java Web Hosting

Author LoginPost Reply

Hi Guys
I am creating an aplication to search online as well as local files.
I am using the following configurations and procedures.

The problem is, eventhough i can search online contents, the local files are
not getting indexed:
Can anyone findout some problem in these steps or some steps i missed.
Any comment will be acceptable:
Thanks in advance


C:\nutch-0.9\ --> The nutch source
C:\apache-tomcat-6.0.16\ --> Tomact
C:\cygwin     --> Cygwin
C:\LocalSearch\localfiles --> Some sample html and text files
C:\nutch-0.9\crawl --> Folder automatically created for indexing

Now I did the following steps:

1)Created a folder called urls inside C:\nutch-0.9
2)Created a file, source.txt, with content:
 http://www.apache.org
 file:///c:/LocalSearch/localfiles/
3)Edited conf/crawl-urlfilter.txt and added the following entries:
# accept hosts in MY.DOMAIN.NAME
 +^file:///c:/LocalSearch/localfiles/*
 +^http://([a-z0-9]*\.)*apache.org/
4) Edit conf/nutch-site.xml and add the following entries inside the
configuration tab:
   
 <property>
   <name>searcher.dir</name>
   <value>C:\nutch-0.9\crawl</value>
 </property>

 <property>
 <name>plugin.includes</name>

<value>protocol-file|protocol-httpclient|protocol-http|urlfilter-regex|parse-(text|html|js|msword|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
 </property>

 <property>
 <name>file.content.limit</name> <value>-1</value>
 </property>

 <property>
  <name>http.agent.name</name>
  <value>MySearch</value>
  <description>My Search Engine</description>
 </property>

 <property>
  <name>http.agent.description</name>
  <value></value>
  <description>Further description of our bot- this text is used in
  the User-Agent header. It appears in parenthesis after the agent name.
  </description>
 </property>

 <property>
  <name>http.agent.url</name>
  <value>www.unfamiliarfacts.blogspot.com</value>
  <description>A URL to advertise in the User-Agent header. This will
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
 </property>

5) Build the project with 'ant' command
6) Create the war file with 'ant war' command
8) From cygwin console:
 bin/nutch crawl urls -dir crawl -depth 3 -topN 10
8) copy the war file from C:\nutch-0.9\build to tomcat's webapps.
9) made sure that
C:\apache-tomcat-6.0.16\webapps\nutch-0.9\WEB-INF\classes\nutch-site.xml
contains:

<property>
 <name>searcher.dir</name>
 <value>C:\nutch-0.9\crawl</value>
</property>

10 Access the search tool from http://localhost:8080/nutch-0.9/

--
Sent from the Nutch - User mailing list archive at Nabble.com.

©2008 java2.5341.com - Jax Systems, LLC, U.S.A.