Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

RE: How do I crawl a site with a cookie for authentication?

Patrick Markiewicz

2008-10-01

Replies: Find Java Web Hosting

Author LoginPost Reply
Is it possible for you to retrieve a resource by using the url:
http://username:password@(protected)

If that works, you could temporarily give a "nutchuser" an account on the site (with as little permission as possible), then crawl the intranet site, and disable the account. Then edit the nutch search page to strip out the "nutchusername:nutchpassword@(protected).

I'm not sure what level of authority you have with the intranet site. You could do a similar trick by crawling the local filesystem of that site, and then just having the search page edit each URL to replace the file system path with a URL path that would work for a logged in user.

If you only have your own account, and can't change any other things, then you might be able to use JMeter to add a cookie and have nutch use JMeter as a proxy. I have never done this, so I don't actually remember if JMeter can add a cookie to a request being made by an application that it proxies.

-----Original Message-----
From: Doğacan Güney [mailto:dogacan@(protected)]
Sent: Wednesday, October 01, 2008 10:08 AM
To: nutch-user@(protected)
Subject: Re: How do I crawl a site with a cookie for authentication?

On Wed, Oct 1, 2008 at 4:35 PM, Yoav Shapira <yoavs@(protected):
> Hi,
>
> I would like to use Nutch to crawl and index an intranet web site for
> internal use. The site requires authentication, and stores the
> credentials in a cookie. I've got a valid login and I have the cookie
> saved, no problem. How do I tell Nutch to use it?
>
> I did some research online before asking, but unfortunately I couldn't
> find a step-by-step answer for a newbie like myself. I see there's an
> http-client plugin that can support some authentication. Is that what
> I should use for cookies? If so, how do I configure it?
>
> Or is there something else I should be doing? If the documentation /
> answer exists, sorry for the hassle and please just point me to it ;)
>

Unfortunately, nutch doesn't have such a feature yet. (One of the problems
is that we do not have a place to store cookies in a distributed setup)

> --
> Thanks,
>
> Yoav
>



--
Doğacan Güney
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.