Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

RE: crawl xml url using nutch-0.9

Chetan Patel

2008-09-27

Replies: Find Java Web Hosting

Author LoginPost Reply

Hi,

Thanks for help.

I have already added this in plugin.includes.

and still getting only root url.

Regards,
Chetan Patel


Edward Quick wrote:
>
>
> Chetan,
>
> Try adding parse-rss in nutch-site.xml. Here's mine:
>
> <property>
>  <name>plugin.includes</name>
>
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf|zip|swf|rss)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>  <description></description>
> </property>
>
>
> Ed.
>
>
>> Date: Sat, 27 Sep 2008 01:30:43 -0700
>> From: chetan@(protected)
>> To: nutch-user@(protected)
>> Subject: crawl xml url using nutch-0.9
>>
>>
>> Hi All,
>>
>> I have tried to crawl xml url (http://sports.yahoo.com/nfl/rss.xml) using
>> depth 2.
>>
>> But it will crawl only root url.
>>
>> Please help me how to crawl root url as well as all sub url of root url.
>>
>> Thanks in advance.
>>
>> Regads,
>> Chetan Patel
>> --
>> View this message in context:
>> http://www.nabble.com/crawl-xml-url-using-nutch-0.9-tp19700770p19700770.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
> _________________________________________________________________
> Get all your favourite content with the slick new MSN Toolbar - FREE
> http://clk.atdmt.com/UKM/go/111354027/direct/01/
>

--
Sent from the Nutch - User mailing list archive at Nabble.com.

©2008 java2.5341.com - Jax Systems, LLC, U.S.A.