Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Crawling local filesystem to provide search access from web

ivrokv

2008-05-03

Replies: Find Java Web Hosting

Author LoginPost Reply

Hello,

I am using nutch-0.9 for indexing html files which are present on the same
server ( Server1) local as nutch. Thus, I am using the protocol-file for
fetching and subsequently indexing. This is working out just great.

My problem is this:

I place the html files in the public folder of my apache server ( Server1 ,
same server used for crawling the local files) so that it can be accessed
at http://mysite.com/page1.html  

When I run a search query on nutch jsp search page, the search results have
a url which is a local filesystem path such as
file:/home/htmlfiles/page1.html


Is it possible to provide nutch with the local filesystem path in the urls
folder for crawling and indexing files( a local filesystem path -
/home/htmlfiles/page1.html) , But during query time from the nutch jsp,
present to the search user the web url ( http://mysite.com/page1.html)

Would this involve some kind of URL normalization in nutch?


Ideally I would prefer to crawl the files from the localFS, than to have
them crawled from the website root folder.I have noticed that crawling is
much faster (since the files are local to nutch) than when I crawl from
mysite.com, even though in both cases the files are on the same physical
server.

One obvious solution is to have nutch fetch the html pages from the
mysite.com root folder and as a result the url will show up correctly as
mysite.com/page.html when a search is performed on nutch. I have tried this
and it works well, but the fetching speed is very slow and I would prefer to
crawl the files using a file protocol which appears to much faster.

Thank you for any advise and help.

Regards

taknev





--
Sent from the Nutch - User mailing list archive at Nabble.com.

©2008 java2.5341.com - Jax Systems, LLC, U.S.A.