Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Recover Nutch Crawl

Dan Plubell

2008-05-14


Author LoginPost Reply
I'm using the org.apache.nutch.crawl.Crawl (Nutch 0.9) class on a single machine.  The fetcher completely ok.  But, the LinkDb.invert step failed because the machine ran out of disk space.
Can I start the LinkDb.invert manually and the rest of the steps manually?
In the \mapred\local directory there are several \map_* directories.  Do the invert, index, dedup, merge steps need these directory?  I need to recover some disk space and I'm wondering if these can be deleted to recover disk space.
Here's the entries from the log file...
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080504231054]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20080421114841
LinkDb: adding segment: crawl/segments/20080421115732
LinkDb: adding segment: crawl/segments/20080421130158
LinkDb: adding segment: crawl/segments/20080421144524
LinkDb: adding segment: crawl/segments/20080421214809
LinkDb: adding segment: crawl/segments/20080422042411
LinkDb: adding segment: crawl/segments/20080422114958
LinkDb: adding segment: crawl/segments/20080424063149
LinkDb: adding segment: crawl/segments/20080430101435
LinkDb: adding segment: crawl/segments/20080504231054
Exception in thread "main" java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob (JobClient.java:604)
 at org.apache.nutch.crawl.LinkDb.invert (LinkDb.java:232)
 at org.apache.nutch.crawl.LinkDb.invert (LinkDb.java:209)
 at org.apache.nutch.crawl.Crawl.main (Crawl.java:136)
Thanks,
Dan
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.