Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

CRAWLING USING HADOOP

kranthi reddy

2008-07-11

Replies: Find Java Web Hosting

Author LoginPost Reply
Hi ,

I am trying to crawl a few sites using nutch and hadoop . I have a cluster
of 10 pc's and i have given nutch as a job file to hadoop. I am able to
execute most commands like

bin_temp/hadoop dfs -put xxx yyy (ls, mkdir) etc

But when i try to run nutch then i get the following error.

bin_temp/nutch crawl tempcrawl/urls -dir tempcrawl/crawl -depth 1

Exception in thread "main" java.net.SocketTimeoutException: timed out
waiting for rpc response
    at org.apache.hadoop.ipc.Client.call (Client.java:473)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
    at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy (RPC.java:247)
    at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:105)
    at
org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.initialize(DistributedFileSystem.java:67)
    at
org.apache.hadoop.fs.FilterFileSystem.initialize (FilterFileSystem.java:57)
    at org.apache.hadoop.fs.FileSystem.get (FileSystem.java:160)
    at org.apache.hadoop.fs.FileSystem.getNamed (FileSystem.java:119)
    at org.apache.hadoop.fs.FileSystem.get (FileSystem.java:91)
    at org.apache.nutch.crawl.Crawl.main (Crawl.java:83)

Some one please help me out.

When i remove the hadoop-env.sh ,hadoop-site.xml and masters file and
replace slaves with "localhost" ....i am able to crawl perfectly well (but
only on master pc :(( )

Thank you in advance.
Kranthi reddy.B
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.