Hi,
I have got following message from log file while crawling xml url.
2008-09-27 16:06:20,920 WARN parse.ParserFactory - ParserFactory:Plugin:
org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via
parse-plugins.xml, but its plugin.xml file does not claim to support
contentType: text/xml
Please help me if you have any idea.
-Chetan
Chetan Patel wrote:
>
> Hi,
>
> Thanks for help.
>
> I have already added this in plugin.includes.
>
> and still getting only root url.
>
> Regards,
> Chetan Patel
>
>
> Edward Quick wrote:
>>
>>
>> Chetan,
>>
>> Try adding parse-rss in nutch-site.xml. Here's mine:
>>
>> <property>
>> <name>plugin.includes</name>
>>
>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf|zip|swf|rss)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>> <description></description>
>> </property>
>>
>>
>> Ed.
>>
>>
>>> Date: Sat, 27 Sep 2008 01:30:43 -0700
>>> From: chetan@(protected)
>>> To: nutch-user@(protected)
>>> Subject: crawl xml url using nutch-0.9
>>>
>>>
>>> Hi All,
>>>
>>> I have tried to crawl xml url (http://sports.yahoo.com/nfl/rss.xml)
>>> using
>>> depth 2.
>>>
>>> But it will crawl only root url.
>>>
>>> Please help me how to crawl root url as well as all sub url of root url.
>>>
>>> Thanks in advance.
>>>
>>> Regads,
>>> Chetan Patel
>>> --
>>> View this message in context:
>>> http://www.nabble.com/crawl-xml-url-using-nutch-0.9-tp19700770p19700770.html
>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>>
>> _________________________________________________________________
>> Get all your favourite content with the slick new MSN Toolbar - FREE
>> http://clk.atdmt.com/UKM/go/111354027/direct/01/
>>
>
>
--
Sent from the Nutch - User mailing list archive at Nabble.com.