Author Login
Post Reply
Ok..
So I just had nutch do a moderate crawl and greated a 20GB link database.
How can I count the number of pages in this DB?
I have limited resources and the hadoop cluster I'm running on is quite
small at the moment. I'd like to index the whole web eventually so I need
to be able to calculate the average number of links per GB so I have an idea
as to how to distribute the resources over the small clusters I intend to
use for distributed searches.
I've also found Nutch/Hadoop to work extremely well (Sept 29 nightly
build)..
Setup:
2x P3 800mhz 512mb ram
1x Celron 1.4ghz 512mb
Total disk space 750GB
If this works well in my sand box I hope to deploy 3 clusters of p4's with 6
machines/cluster and 10TB of disk space.