Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

RE: crawl xml url using nutch-0.9

Edward Quick

2008-09-27

Replies: Find Java Web Hosting

Author LoginPost Reply


>
>
> Hi,
>
> I have got following message from log file while crawling xml url.
>
> 2008-09-27 16:06:20,920 WARN parse.ParserFactory - ParserFactory:Plugin:
> org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via
> parse-plugins.xml, but its plugin.xml file does not claim to support
> contentType: text/xml
>
> Please help me if you have any idea.

Possibly a problem with the content type. For rss files I think the content type is supposed to be application/rss+xml


>
> -Chetan
>
>
>
> Chetan Patel wrote:
> >
> > Hi,
> >
> > Thanks for help.
> >
> > I have already added this in plugin.includes.
> >
> > and still getting only root url.
> >
> > Regards,
> > Chetan Patel
> >
> >
> > Edward Quick wrote:
> >>
> >>
> >> Chetan,
> >>
> >> Try adding parse-rss in nutch-site.xml. Here's mine:
> >>
> >> <property>
> >>  <name>plugin.includes</name>
> >>
> >> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf|zip|swf|rss)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >>  <description></description>
> >> </property>
> >>
> >>
> >> Ed.
> >>
> >>
> >>> Date: Sat, 27 Sep 2008 01:30:43 -0700
> >>> From: chetan@(protected)
> >>> To: nutch-user@(protected)
> >>> Subject: crawl xml url using nutch-0.9
> >>>
> >>>
> >>> Hi All,
> >>>
> >>> I have tried to crawl xml url (http://sports.yahoo.com/nfl/rss.xml)
> >>> using
> >>> depth 2.
> >>>
> >>> But it will crawl only root url.
> >>>
> >>> Please help me how to crawl root url as well as all sub url of root url.
> >>>
> >>> Thanks in advance.
> >>>
> >>> Regads,
> >>> Chetan Patel
> >>> --
> >>> View this message in context:
> >>> http://www.nabble.com/crawl-xml-url-using-nutch-0.9-tp19700770p19700770.html
> >>> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>>
> >>
> >> _________________________________________________________________
> >> Get all your favourite content with the slick new MSN Toolbar - FREE
> >> http://clk.atdmt.com/UKM/go/111354027/direct/01/
> >>
> >
> >
>
> --
> View this message in context: http://www.nabble.com/crawl-xml-url-using-nutch-0.9-tp19700770p19701619.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.