Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Just to save webpages (Newbie question)

wuwuengr@gmail.com

2008-10-08

Replies: Find Java Web Hosting

Author LoginPost Reply
I am new to Nutch.

My goal is to extract content (local listings) of a certain website. I have
obtained the urls of all the listings (only ~20K). And I also wrote a parser
to pull the contents (like address and phone). All I need is to download the
urls.

But as I used download tool to batch download the urls, very quickly I
started to get 404 responses in downloaded pages.

Is there a way I can do this in nutch? What's the risk of being blocked
again? I just want the urls, no crawl, no indexing, just plain fetch and
leaving them intact.
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.