Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Re: Counting the links in the DB

Ronny

2008-10-05

Replies: Find Java Web Hosting

Author LoginPost Reply
This can be of use
#bin/nutch readdb crawldir/crawdb -stats
It will give you the links in ya webdb approximately 2M links for 18GB ;-)
Regard
Ronny
Webmaster wrote:
> Ok..
>
> So I just had nutch do a moderate crawl and greated a 20GB link database.
>
> How can I count the number of pages in this DB?
>
> I have limited resources and the hadoop cluster I'm running on is quite
> small at the moment. I'd like to index the whole web eventually so I need
> to be able to calculate the average number of links per GB so I have an idea
> as to how to distribute the resources over the small clusters I intend to
> use for distributed searches.
>
> I've also found Nutch/Hadoop to work extremely well (Sept 29 nightly
> build)..
>
> Setup:
>
> 2x P3 800mhz 512mb ram
> 1x Celron 1.4ghz 512mb
> Total disk space 750GB
>
> If this works well in my sand box I hope to deploy 3 clusters of p4's with 6
> machines/cluster and 10TB of disk space.
>  
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - http://www.avg.com
> Version: 8.0.169 / Virus Database: 270.7.5/1705 - Release Date: 10/3/2008 8:18 AM
>
>  


--
mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
-> Ronald Muwonge
-> 'The M'
-> Africa's Search
-> www.mputa.com
-> www.africa.mputa.com
mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm

©2008 java2.5341.com - Jax Systems, LLC, U.S.A.