Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Ignoring robots.txt

Vijay Krishnan

2008-05-23

Replies: Find Java Web Hosting

Author LoginPost Reply
Hi all,

  I wish to crawl a certain set of URLs to depth 1 (without any
deeper crawling) for further analysis. I find that nutch does not
crawl URLs which do not have the requisite permissions in robots.txt.
Is there some way I can disable nutch from looking at robots.txt? That
will make my job much easier than trying to save the webpages some
other way and then passing it through nutch.


Thanks
--
Vijay Krishnan
http://www.cs.stanford.edu/~vijayk
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.