Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Re: Indexing static html files

Winton Davies

2008-07-03

Replies:

Author LoginPost Reply
Ryan,

You can generate a file of FILE urls (eg)

file:///x/y/z/file1.html
file:///x/y/z/file2.html

Use find and AWK accordingly to generate this. put it in the url
directory and just set depth to 1, and change crawl_urlfilter.txt to
admit file:///x/y/z/ (note, if you dont head qualify it, it will
apparently try to index directories above the base one, by using ../
notation. (I only read this, havent tried it).

then just do the intranet crawl example.

NOTE this will NOT (as far as I can see no matter how much tweaking),
use ANCHOR TEXT or PageRank (OPIC version) for any links in these
files. The ONLY way to do this is to use a webserver as far as I can
tell. Don't understand the logic, but there you are. Note, if you
use a webserver, be aware you will have to disable IGNORE.INTERNAL
setting in Nutch-Site.xml (you'll be messing around a lot in here).

Cheers,
Winton



At 2:40 PM -0400 7/3/08, Ryan Smith wrote:
>Is there a simple way to have nutch index a folder full of other folders and
>html files?
>
>I was hoping to avoid having to run apache to serve the html files, and then
>have nutch crawl the site on apache.
>
>Thank you,
>-Ryan

©2008 java2.5341.com - Jax Systems, LLC, U.S.A.