Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Re: Indexing static html files

Ryan Smith

2008-07-05

Replies: Find Java Web Hosting

Author LoginPost Reply
Hello,
I tried what Winton said. I generated a file with all the file:///x/y/z
urls, but nutch wont inject any into the crawldb
I even set the crawl-urlfilter.txt to allow everything:
+.
It seems like ./bin/nutch crawl  is reading the file, but its finding 0
urls to fetch. I test this on http:// links and they get injected.
Is there a plugin or something ic an modify to allow file urls to be
injected into the crawldb?
Thank you.
-Ryan

On Thu, Jul 3, 2008 at 6:03 PM, Winton Davies <wdavies@(protected)>
wrote:

> Ryan,
>
> You can generate a file of FILE urls (eg)
>
> file:///x/y/z/file1.html
> file:///x/y/z/file2.html
>
> Use find and AWK accordingly to generate this. put it in the url directory
> and just set depth to 1, and change crawl_urlfilter.txt to admit
> file:///x/y/z/ (note, if you dont head qualify it, it will apparently try to
> index directories above the base one, by using ../ notation. (I only read
> this, havent tried it).
>
> then just do the intranet crawl example.
>
> NOTE this will NOT (as far as I can see no matter how much tweaking), use
> ANCHOR TEXT or PageRank (OPIC version) for any links in these files. The
> ONLY way to do this is to use a webserver as far as I can tell. Don't
> understand the logic, but there you are. Note, if you use a webserver, be
> aware you will have to disable IGNORE.INTERNAL setting in Nutch-Site.xml
> (you'll be messing around a lot in here).
>
> Cheers,
> Winton
>
>
>
>
> At 2:40 PM -0400 7/3/08, Ryan Smith wrote:
>
>> Is there a simple way to have nutch index a folder full of other folders
>> and
>> html files?
>>
>> I was hoping to avoid having to run apache to serve the html files, and
>> then
>> have nutch crawl the site on apache.
>>
>> Thank you,
>> -Ryan
>>
>
>
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.