Author Login
Post Reply
Hi Ryan,
I just used the regular intranet crawl, didnt try to do the inject
W
At 6:16 PM -0400 7/5/08, Ryan Smith wrote:
>Winton,
>I added the override property to nutch-site.xml ( i saw the one in
>nutch-default.xml after your email ) , still no urls being added to the
>crawldb.
>Can you verify this by trying to inject file urls into a test crawl db?
>Any other ideas?
>
>-Ryan
>
>On Sat, Jul 5, 2008 at 5:47 PM, Winton Davies <wdavies@(protected)>
>wrote:
>
>> Hey Ryan,
>>
>> There's something else, that needs to be set as well - sorry I forgot about
>> it.
>>
>>
>> <property>
>> <name>plugin.includes</name>
>>
>>
>><value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>> </property>
>>
>>
>> Hope this helps!
>>
>> W
>>
>>
>>
>> Hello,
>>> I tried what Winton said. I generated a file with all the file:///x/y/z
>>> urls, but nutch wont inject any into the crawldb
>>> I even set the crawl-urlfilter.txt to allow everything:
>>> +.
>>> It seems like ./bin/nutch crawl is reading the file, but its finding 0
>>> urls to fetch. I test this on http:// links and they get injected.
>>> Is there a plugin or something ic an modify to allow file urls to be
>>> injected into the crawldb?
>>> Thank you.
>>> -Ryan
>>>
>>> On Thu, Jul 3, 2008 at 6:03 PM, Winton Davies <wdavies@(protected)>
>>> wrote:
>>>
>>> Ryan,
>>>>
>>>> > You can generate a file of FILE urls (eg)
>>>
>>>>
>>>> file:///x/y/z/file1.html
>>>> file:///x/y/z/file2.html
>>>>
>>>> Use find and AWK accordingly to generate this. put it in the url
>>>> directory
>>>> and just set depth to 1, and change crawl_urlfilter.txt to admit
>>>> file:///x/y/z/ (note, if you dont head qualify it, it will apparently
>>>> try to
>>>> index directories above the base one, by using ../ notation. (I only
>>>> read
>>>> this, havent tried it).
>>>>
>>>> then just do the intranet crawl example.
>>>>
>>>> NOTE this will NOT (as far as I can see no matter how much tweaking),
>>>> use
>>>> ANCHOR TEXT or PageRank (OPIC version) for any links in these files. The
>>>> ONLY way to do this is to use a webserver as far as I can tell. Don't
>>>> understand the logic, but there you are. Note, if you use a webserver,
>>>> be
>>>> aware you will have to disable IGNORE.INTERNAL setting in Nutch-Site.xml
>>>> (you'll be messing around a lot in here).
>>>>
>>>> Cheers,
>>>> Winton
>>>>
>>>>
>>>>
>>>>
>>>> At 2:40 PM -0400 7/3/08, Ryan Smith wrote:
>>>>
>>>> Is there a simple way to have nutch index a folder full of other
>>>>> folders
>>>>> and
>>>>> html files?
>>>>>
>>>>> I was hoping to avoid having to run apache to serve the html files, and
>>>>> then
>>>>> have nutch crawl the site on apache.
>>>>>
>>>>> Thank you,
>>>>> -Ryan
>>>>>
>>>>>
>>>>
>>>>
>>