Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Re: How do I crawl a site with a cookie for authentication?

Doğacan Güney

2008-10-01

Replies: Find Java Web Hosting

Author LoginPost Reply
On Wed, Oct 1, 2008 at 4:35 PM, Yoav Shapira <yoavs@(protected):
> Hi,
>
> I would like to use Nutch to crawl and index an intranet web site for
> internal use. The site requires authentication, and stores the
> credentials in a cookie. I've got a valid login and I have the cookie
> saved, no problem. How do I tell Nutch to use it?
>
> I did some research online before asking, but unfortunately I couldn't
> find a step-by-step answer for a newbie like myself. I see there's an
> http-client plugin that can support some authentication. Is that what
> I should use for cookies? If so, how do I configure it?
>
> Or is there something else I should be doing? If the documentation /
> answer exists, sorry for the hassle and please just point me to it ;)
>

Unfortunately, nutch doesn't have such a feature yet. (One of the problems
is that we do not have a place to store cookies in a distributed setup)

> --
> Thanks,
>
> Yoav
>



--
Doğacan Güney
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.