Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Re: Indexing static html files

Winton Davies

2008-07-05

Replies: Find Java Web Hosting

Author LoginPost Reply
Not without modifying the code. I dont think it respects <BASE> for
example, if you crawl it as File:///
Frankly if you can, just serve it thru DOCROOT - it will be less
painful in the end!

- Serving URL - You can change it if you know how to set up Tomcat.

Winton




>Hi Winton,
>I found my problem. I was only editing crawl-urlfilter.txt and not
>regexp-urlfilter.txt
>
>Thanks for the help.
>
>I have 2 questions:
>
>After i crawl my files, they will be indexed with file:///x/y/z/.......
>Is there an chance i can easily change the link prefix to
>http://somesite.com/ ?
>
>And i noticed from the tutorial, i only get one path to have nutch to serve
>searches for?
>http://peterpuwang.googlepages.com/NutchGuideForDummies.htm
>
>d.    Set Your Searcher Directory
>
>Next, navigate to your nutch webapp folder then WEB-INF/classes. Edit the
>nutch-site.xml file and add the following to it (make sure you don't have
>two sets of <configuration></configuration> tags!):
>
><configuration>
>
>  <property>
>
>   <name>searcher.dir</name>
>
>   <value>your_crawl_folder_here</value>
>
>  </property>
>
></configuration>
>
>
>Can i have nutch search multiple crawl folders?
>
>Thanks again,
>
>-Ryan
>
>On Sat, Jul 5, 2008 at 7:17 PM, Winton Davies <wdavies@(protected)>
>wrote:
>
>> Hi Ryan,
>>
>> I just used the regular intranet crawl, didnt try to do the inject
>>
>> W
>>
>>
>> At 6:16 PM -0400 7/5/08, Ryan Smith wrote:
>>
>>> Winton,
>>> I added the override property to nutch-site.xml ( i saw the one in
>>> nutch-default.xml after your email ) , still no urls being added to the
>>> crawldb.
>>> Can you verify this by trying to inject file urls into a test crawl db?
>>> Any other ideas?
>>>
>>> -Ryan
>>>
>>> On Sat, Jul 5, 2008 at 5:47 PM, Winton Davies <wdavies@(protected)>
>>> wrote:
>>>
>>>   Hey Ryan,
>>>>
>>>>  There's something else, that needs to be set as well - sorry I forgot
>>>> about
>>>>  it.
>>>>
>>>>
>>>>  <property>
>>>>  <name>plugin.includes</name>
>>>>
>>>>
>>>>
>>>>
>>>><value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>>>>  </property>
>>>>
>>>>
>>>>  Hope this helps!
>>>>
>>>>  W
>>>>
>>>>
>>>>
>>>>  Hello,
>>>>
>>>>>  I tried what Winton said. I generated a file with all the
>>>>> file:///x/y/z
>>>>>  urls, but nutch wont inject any into the crawldb
>>>>>  I even set the crawl-urlfilter.txt to allow everything:
>>>>>  +.
>>>>>  It seems like ./bin/nutch crawl  is reading the file, but its finding
>>>>> 0
>>>>>  urls to fetch. I test this on http:// links and they get injected.
>>>>>  Is there a plugin or something ic an modify to allow file urls to be
>>>>>  injected into the crawldb?
>>>>>  Thank you.
>>>>>  -Ryan
>>>>>
>>>>>  On Thu, Jul 3, 2008 at 6:03 PM, Winton Davies <wdavies@(protected)
>>>>> >
>>>>>  wrote:
>>>>>
>>>>>   Ryan,
>>>>>
>>>>>>
>>>>>>   > You can generate a file of FILE urls (eg)
>>>>>>
>>>>>
>>>>>
>>>>>>  file:///x/y/z/file1.html
>>>>>>  file:///x/y/z/file2.html
>>>>>>
>>>>>>  Use find and AWK accordingly to generate this. put it in the url
>>>>>>  directory
>>>>>>  and just set depth to 1, and change crawl_urlfilter.txt to admit
>>>>>>  file:///x/y/z/ (note, if you dont head qualify it, it will apparently
>>>>>>  try to
>>>>>>  index directories above the base one, by using ../ notation. (I only
>>>>>>  read
>>>>>>  this, havent tried it).
>>>>>>
>>>>>>  then just do the intranet crawl example.
>>>>>>
>>>>>>  NOTE this will NOT (as far as I can see no matter how much tweaking),
>>>>>>  use
>>>>>>  ANCHOR TEXT or PageRank (OPIC version) for any links in these files.
>>>>>> The
>>>>>>  ONLY way to do this is to use a webserver as far as I can tell. Don't
>>>>>>  understand the logic, but there you are. Note, if you use a webserver,
>>>>>>  be
>>>>>>  aware you will have to disable IGNORE.INTERNAL setting in
>>>>>> Nutch-Site.xml
>>>>>>  (you'll be messing around a lot in here).
>>>>>>
>>>>>>  Cheers,
>>>>>>  Winton
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  At 2:40 PM -0400 7/3/08, Ryan Smith wrote:
> >>>>>
>>>>>>   Is there a simple way to have nutch index a folder full of other
>>>>>>
>>>>>>>  folders
>>>>>>>  and
>>>>>>>  html files?
>>>>>>>
>>>>>>>  I was hoping to avoid having to run apache to serve the html files,
>>>>>>> and
>>>>>>>  then
>>>>>>>  have nutch crawl the site on apache.
>>>>>>>
>>>>>>>  Thank you,
>>>>>>>  -Ryan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>

©2008 java2.5341.com - Jax Systems, LLC, U.S.A.