Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Re: Hardware Specifications

ogjunk-nutch

2008-06-12

Replies: Find Java Web Hosting

Author LoginPost Reply
Hm, hm.

I can't speak for Nutch's search (don't have it running at the moment), but I am looking at a cluster that is running a fetch job and a generate job concurrently and I see both cores on the dual-core server being utilized about equally.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: Sean Dean <seandean@(protected)>
> To: nutch-user@(protected)
> Sent: Saturday, June 7, 2008 3:52:33 AM
> Subject: Re: Hardware Specifications
>
> Hey Otis,
>
> I will first disclose that the OS im using for my Nutch implementation is
> FreeBSD 7 (amd64) and my differ from a standard 64-bit Linux distribution. The
> JDK however is your standard SUN 1.5.0-14 64-bit package.
>
> I find that the JVM does not treat Nutch as something that's truly
> multithreaded. Which ever task you ask it to do, be it serve results, fetch,
> inject, update, etc. it will always peg one core and not use anything else
> (sometimes it will share processing on another core but this is just the garbage
> collection thread inside the JVM).
>
> Having smaller indexes (15-20M) on multiple nutch instances (with 4GB or so of
> RAM) doesn't fix this limitation, but it does cheat in that each instance runs
> as its own independent JVM and as such the OS will execute operations on the
> core which has the lowest utilization via the scheduler (in my case FreeBSD's
> ULE) for each instance.
>
> When you think about it this type of setup scales very well horizontally, much
> like Nutch/Hadoop itself. I find creating one huge index on the same machine and
> giving it everything it has in terms of resources has diminishing returns, and
> as my example points out never uses it all anyway.
>
> One negative about this setup though is detailed in NUTCH-92. This issue alone
> kills any attempt to scale your search engine for "main stream" commercial
> success (e.g. Google).
>
>
>
> ----- Original Message ----
> From: "ogjunk-nutch@(protected)"
> To: nutch-user@(protected)
> Sent: Friday, June 6, 2008 12:20:41 PM
> Subject: Re: Hardware Specifications
>
> Dan, you left out one important "bit" - this is a 64-bit machine?
>
> Sean, out of curiosity... is this really better than running a single JVM on a
> multi-core 64-bit machine with 32GB of RAM than running a single JVM instance,
> single Nutch instance, and letting the OS switch between cores?
>
>
> As for fetching/indexing/searching - you probably don't want to do this on the
> same set of machines. Use a set of machines for fetching/indexing, and a set of
> machines for serving search requests.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> ----- Original Message ----
> > From: Sean Dean
> > To: nutch-user@(protected)
> > Sent: Thursday, June 5, 2008 3:45:41 PM
> > Subject: Re: Hardware Specifications
> >
> > Another idea is to setup 8 seperate nutch instances on the same server, each
> > with its own 20M index.
> >
> > The idea behind this is that one-core per application will be used, although
> its
> > not pegged and the RAM is used in ~4GB chunks (JVM setting) for each instance.
> >
> > This would be used for serving results only though, you would have to disable
> > part or all of this when in fetching mode but it would give you 160M pages and
>
> > still very good speeds (about 4-5 per second or more as other factors come
> into
> > play). Keep in mind we use 8 hard drives, each associated with its own
> instance
> > on the server but as long as the RAID FC setup you have is very fast the
> results
> > should be comparible (maybe even faster).
> >
> >
> > ----- Original Message ----
> > From: Dennis Kubes
> > To: nutch-user@(protected)
> > Sent: Thursday, June 5, 2008 2:38:04 PM
> > Subject: Re: Hardware Specifications
> >
> > In memory index 15M. On disk index, slower but still doable where
> > response time isn't critical, ~350M pages maybe more.
> >
> > Dennis
> >
> > Dan Segel wrote:
> > > We have a server that has 30TB of hard drive space connected through fiber,
> > > 2 quad core 2.5ghz, and 32gb of ram. If fetching 5 searches per second how
> > > many million indexed pages do you think we can achieve?
> > >

©2008 java2.5341.com - Jax Systems, LLC, U.S.A.