Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Re: Extensive web crawl

ogjunk-nutch

2008-10-20

Replies: Find Java Web Hosting

Author LoginPost Reply
Axel, how did this go? I'd love to know if you got to 1B.



Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Webmaster <webmaster@(protected)>
> To: nutch-user@(protected)
> Sent: Tuesday, October 7, 2008 1:13:29 AM
> Subject: Extensive web crawl
>
> Ok..
>
> So I want to index the web.. All of it..
>
> Any thoughts on how to automate this so I can just point the spider off on
> it's merry way and have it return 20 billion pages?
>
> So far I've been injecting random portions of the DMOZ mixed with other urls
> like directory.yahoo.com and wiki.org. I was hoping this would give me a
> good retuen with an unrestricted URL-filter where MY.DOMAIN.COM was replaced
> with *.* -- Perhaps this is my error and that should be left as is and the
> last line should be +. instead of -. ?
>
> Anyhow after injecting 2000 urls and a few of my own I still only get back
> minimal results in the range of 500 to 600k urls.
>
> Right now I have a new grawl going with 1 million injected urls from the
> DMOZ, I'm thinking that this should return a 20 million page index at
> least.. No?
>
> Anyhow.. I have more HD space on the way and would like to get the indexing
> up to 1 billion by the end of the week..
>
> Any examples on how to set up the url-filter.txt and regex-filter.txt would
> be helpful..
>
> Thanks..
>
> Axel..

©2008 java2.5341.com - Jax Systems, LLC, U.S.A.