Author Login
Post Reply
Hey Ryan,
There's something else, that needs to be set as well - sorry I forgot about it.
<property>
<name>plugin.includes</name>
<value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
Hope this helps!
W
>Hello,
>I tried what Winton said. I generated a file with all the file:///x/y/z
>urls, but nutch wont inject any into the crawldb
>I even set the crawl-urlfilter.txt to allow everything:
>+.
>It seems like ./bin/nutch crawl is reading the file, but its finding 0
>urls to fetch. I test this on http:// links and they get injected.
>Is there a plugin or something ic an modify to allow file urls to be
>injected into the crawldb?
>Thank you.
>-Ryan
>
>On Thu, Jul 3, 2008 at 6:03 PM, Winton Davies <wdavies@(protected)>
>wrote:
>
>> Ryan,
>>
> > You can generate a file of FILE urls (eg)
>>
>> file:///x/y/z/file1.html
>> file:///x/y/z/file2.html
>>
>> Use find and AWK accordingly to generate this. put it in the url directory
>> and just set depth to 1, and change crawl_urlfilter.txt to admit
>> file:///x/y/z/ (note, if you dont head qualify it, it will apparently try to
>> index directories above the base one, by using ../ notation. (I only read
>> this, havent tried it).
>>
>> then just do the intranet crawl example.
>>
>> NOTE this will NOT (as far as I can see no matter how much tweaking), use
>> ANCHOR TEXT or PageRank (OPIC version) for any links in these files. The
>> ONLY way to do this is to use a webserver as far as I can tell. Don't
>> understand the logic, but there you are. Note, if you use a webserver, be
>> aware you will have to disable IGNORE.INTERNAL setting in Nutch-Site.xml
>> (you'll be messing around a lot in here).
>>
>> Cheers,
>> Winton
>>
>>
>>
>>
>> At 2:40 PM -0400 7/3/08, Ryan Smith wrote:
>>
>>> Is there a simple way to have nutch index a folder full of other folders
>>> and
>>> html files?
>>>
>>> I was hoping to avoid having to run apache to serve the html files, and
>>> then
>>> have nutch crawl the site on apache.
>>>
>>> Thank you,
>>> -Ryan
>>>
>>
>>