Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Re: Re: use crawl command to fetch arbitrary pages?

ywang

2008-04-23


Author LoginPost Reply
Hi Hilkiah G. Lavinier,

  Ur second point "also, set db.ignore.external.links to false which allows nutch to fetch pages outside of initial injected list (i.e. domains)“ exactly answers my question, which was to use ./bin/nutch crawl command to fetch pages which are not only restricted in the initial domains. In other words, the spider can go outside from the initial domains setting in urls/xx.txt.

  Thanks, and have a good day

Yong


2008-04-24



ywang



发件人: Hilkiah Lavinier
发送时间: 2008-04-23 21:40:05
收件人: nutch-user@(protected)
抄送:
主题: Re: use crawl command to fetch arbitrary pages?

Ywang,
Not sure what you mean by arbitrary,maybe u need to be a bit more specific here.
However if you are trying to do a webcrawl, here's some advice :
- consider using urlfilter-suffix instead of one of the regex filters.
- also, set db.ignore.external.links to false which allows nutch to fetch pages outside of initial injected list (i.e. domains)
- lastly since u must have a starting point, create an inject list which would allow you to fetch the pages desired
There is a crawl script availabe on the nutch wiki which you can use instead of ./bin/nutch crawl.
Regards,

Hilkiah G. Lavinier MEng (Hons), ACGI
6 Winston Lane,
Goodwill,
Roseau, Dominica
Mbl: (767) 275 3382
Hm : (767) 440 3924
Fax: (767) 440 4991
VoIP USA: (646) 432 4487
Email: hilkiah@(protected)
Email: hilkiah.lavinier@(protected)
IM: Yahoo hilkiah / MSN hilkiahlavinier@(protected)
IM: ICQ #8978201 / AOL hilkiah21
----- Original Message ----
From: ywang <ywang@(protected)>
To: "nutch-user@(protected)>
Sent: Saturday, April 19, 2008 10:32:17 AM
Subject: use crawl command to fetch arbitrary pages?
Dear all,
How can I use crawl command to fetch arbitrary pages, without being restricted in a domain which defined in crawl-urlfilter.txt?
I try to delete or logout that domain property, but the shell will give me a error like "No urls to fetch - check your seed list and URL filters.
Oh, in addition, crawl command works well with setting the domain property.
Cheers
Yong
2008-04-19
ywang
   ____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
Powered by UESTC SMG
SPAM, virus-free and secure email
https://smg.uestc.edu.cn
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.