I've been using nutch to crawl a lot of news feeds and I had to modify
my plugins file to handle a bunch of mime types. Not many sites
follow the spec on what mime type to use.
My parse-plugins.xml file has mime-type mappings for all of these:
text/html
text/plain
text/rss
text/xml
application/xml
application/rss+xml
application/atom+xml
application/xhtml+xml
application/octet-stream
-dave
On Sep 27, 2008, at 6:44 AM, Chetan Patel wrote:
>
> Hi,
>
> I have got following message from log file while crawling xml url.
>
> 2008-09-27 16:06:20,920 WARN parse.ParserFactory -
> ParserFactory:Plugin:
>
org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml
> via
> parse-plugins.xml, but its plugin.xml file does not claim to support
> contentType: text/xml
>
> Please help me if you have any idea.
>
> -Chetan
>
>
>
> Chetan Patel wrote:
>>
>> Hi,
>>
>> Thanks for help.
>>
>> I have already added this in plugin.includes.
>>
>> and still getting only root url.
>>
>> Regards,
>> Chetan Patel
>>
>>
>> Edward Quick wrote:
>>>
>>>
>>> Chetan,
>>>
>>> Try adding parse-rss in nutch-site.xml. Here's mine:
>>>
>>> <property>
>>> <name>plugin.includes</name>
>>>
>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|
>>> msexcel|msword|mspowerpoint|pdf|zip|swf|rss)|index-basic|query-
>>> (basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|
>>> regex|basic)</value>
>>> <description></description>
>>> </property>
>>>
>>>
>>> Ed.
>>>
>>>
>>>> Date: Sat, 27 Sep 2008 01:30:43 -0700
>>>> From: chetan@(protected)
>>>> To: nutch-user@(protected)
>>>> Subject: crawl xml url using nutch-0.9
>>>>
>>>>
>>>> Hi All,
>>>>
>>>> I have tried to crawl xml url (http://sports.yahoo.com/nfl/rss.xml)
>>>> using
>>>> depth 2.
>>>>
>>>> But it will crawl only root url.
>>>>
>>>> Please help me how to crawl root url as well as all sub url of
>>>> root url.
>>>>
>>>> Thanks in advance.
>>>>
>>>> Regads,
>>>> Chetan Patel
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/crawl-xml-url-using-nutch-0.9-tp19700770p19700770.html
>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>
>>>
>>> _________________________________________________________________
>>> Get all your favourite content with the slick new MSN Toolbar - FREE
>>> http://clk.atdmt.com/UKM/go/111354027/direct/01/
>>>
>>
>>
>
> --
> View this message in context: http://www.nabble.com/crawl-xml-url-using-nutch-0.9-tp19700770p19701619.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>