Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Re: How do I crawl a site with a cookie for authentication?

Yoav Shapira

2008-10-01

Replies: Find Java Web Hosting

Author LoginPost Reply
Hi Patrick,

Thanks for your help. I'll dig around a bit more, try the proxy
thing, maybe try the database approach, and see how it goes. Much
appreciated,

Yoav

On Wed, Oct 1, 2008 at 1:14 PM, Patrick Markiewicz
<pmarkiewicz@(protected):
> Hi Yoav,
>     If the content is dynamic, presumably it is stored in a
> database? I was just thinking that it might be easier to use some
> database utilities to index the information.
>
>     Do you know how to use JMeter to record the requests that a web
> browser makes? The browser uses a particular port as a proxy. I know
> that the JMeter cookie manager can save the cookies that are gathered as
> part of the request.
>     I'm pretty sure that nutch can use a proxy.
> http://wiki.apache.org/nutch/SetupProxyForNutch
>
> According to this page here:
> http://jakarta.apache.org/jmeter/usermanual/component_reference.html#HTT
> P_Cookie_Manager
> you can manually add a cookie that will be used by all threads. I am
> guessing that if you set up JMeter to act as a proxy, that this thread
> would be included as one of those that contains the cookie.
>
> If the proxy thread can not have cookies added manually, then this
> strategy wouldn't work.
>
> Patrick
>
> -----Original Message-----
> From: yoavshapira@(protected)
> Yoav Shapira
> Sent: Wednesday, October 01, 2008 11:47 AM
> To: nutch-user@(protected)
> Subject: Re: How do I crawl a site with a cookie for authentication?
>
> Patrick,
> Thank you for the answers. More below:
>
> 2008/10/1 Patrick Markiewicz <pmarkiewicz@(protected)>:
>> Is it possible for you to retrieve a resource by using the url:
>> http://username:password@(protected)
>
> The system does not support HTTP Basic authentication at this time,
> unfortunately.
>
>> I'm not sure what level of authority you have with the intranet site.
> You could do a similar >trick by crawling the local filesystem of that
> site, and then just having the search page edit
>
> The site is dynamically generated. There are no meaningful static
> files on the file system.
>
>> If you only have your own account, and can't change any other things,
> then you might be >able to use JMeter to add a cookie and have nutch use
> JMeter as a proxy. I have never
>
> This is very intriguing. How would I get started on this? I've used
> JMeter in the past for simple test plans, but never as an HTTP proxy.
>
> Yoav
>



--
Thanks,

Yoav
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.