Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

problem in crawling

Mohammad Monirul Hoque

2008-08-03

Replies: Find Java Web Hosting

Author LoginPost Reply
Hi,

I m using nutch 0.9 on ubuntu on a single machine with pseudo-distributed mode.
When i executing the following command

bin/nutch crawl urls -dir crawled -depth 10

this is what i got from the hadoop log:

2008-08-03 03:10:17,392 INFO crawl.Crawl - crawl started in: crawled
2008-08-03 03:10:17,392 INFO crawl.Crawl - rootUrlDir = urls
2008-08-03 03:10:17,392 INFO crawl.Crawl - threads = 10
2008-08-03 03:10:17,392 INFO crawl.Crawl - depth = 10
2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: starting
2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: crawlDb: crawled/crawldb
2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: urlDir: urls
2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries.
2008-08-03 03:10:35,227 INFO crawl.Injector - Injector: Merging injected urls into crawl db.
2008-08-03 03:10:59,724 INFO crawl.Injector - Injector: done
2008-08-03 03:11:00,791 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch.
2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: starting
2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: segment: crawled/segments/20080803031100
2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: filtering: false
2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: topN: 2147483647
2008-08-03 03:11:24,239 INFO crawl.Generator - Generator: Partitioning selected urls by host, for politeness.
2008-08-03 03:11:47,583 INFO crawl.Generator - Generator: done.
2008-08-03 03:11:47,583 INFO fetcher.Fetcher - Fetcher: starting
2008-08-03 03:11:47,583 INFO fetcher.Fetcher - Fetcher: segment: crawled/segments/20080803031100
2008-08-03 03:12:36,915 INFO fetcher.Fetcher - Fetcher: done
2008-08-03 03:12:36,951 INFO crawl.CrawlDb - CrawlDb update: starting
2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: db: crawled/crawldb
2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: segments: [crawled/segments/20080803031100]
2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: additions allowed: true
2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: URL normalizing: true
2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: URL filtering: true
2008-08-03 03:12:36,967 INFO crawl.CrawlDb - CrawlDb update: Merging segment data into db.
2008-08-03 03:13:20,341 INFO crawl.CrawlDb - CrawlDb update: done
2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch.
2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: starting
2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: segment: crawled/segments/20080803031321
2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: filtering: false
2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: topN: 2147483647
2008-08-03 03:13:39,667 INFO crawl.Generator - Generator: Partitioning selected urls by host, for politeness.
2008-08-03 03:14:04,963 INFO crawl.Generator - Generator: done.
2008-08-03 03:14:04,963 INFO fetcher.Fetcher - Fetcher: starting
2008-08-03 03:14:04,963 INFO fetcher.Fetcher - Fetcher: segment: crawled/segments/20080803031321
2008-08-03 03:21:26,809 INFO fetcher.Fetcher - Fetcher: done
2008-08-03 03:21:26,851 INFO crawl.CrawlDb - CrawlDb update: starting
2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: db: crawled/crawldb
2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: segments: [crawled/segments/20080803031321]
2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: additions allowed: true
2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: URL normalizing: true
2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: URL filtering: true
2008-08-03 03:21:26,866 INFO crawl.CrawlDb - CrawlDb update: Merging segment data into db.
2008-08-03 03:22:13,223 INFO crawl.CrawlDb - CrawlDb update: done
2008-08-03 03:22:14,251 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch.
2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: starting
2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: segment: crawled/segments/20080803032214
2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: filtering: false
2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: topN: 2147483647
2008-08-03 03:22:34,459 INFO crawl.Generator - Generator: Partitioning selected urls by host, for politeness.
2008-08-03 03:22:59,733 INFO crawl.Generator - Generator: done.
2008-08-03 03:22:59,734 INFO fetcher.Fetcher - Fetcher: starting
2008-08-03 03:22:59,734 INFO fetcher.Fetcher - Fetcher: segment: crawled/segments/20080803032214
2008-08-03 04:24:53,193 INFO fetcher.Fetcher - Fetcher: done

What i found executing the command:
bin/hadoop dfs -ls
Found 2 items
/user/nutch/crawled   <dir>
/user/nutch/urls     <dir>
$ bin/hadoop dfs -ls crawled
Found 2 items
/user/nutch/crawled/crawldb   <dir>
/user/nutch/crawled/segments   <dir>

Where is linkdb,indexes and index? So pls tell me which may be the error.

Here is my hadoop-site.xml:

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>fs.default.name</name>
<value>sysmonitor:9000</value>
<description>
  The name of the default file system. Either the literal string
  "local" or a host:port for NDFS.
</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>sysmonitor:9001</value>
<description>
  The host and port that the MapReduce job tracker runs at. If
  "local", then jobs are run in-process as a single map and
  reduce task.
</description>
</property>
<property>
<name>mapred.tasktracker.tasks.maximum</name>
<value>2</value>
<description>
  The maximum number of tasks that will be run simultaneously by
  a task tracker. This should be adjusted according to the heap size
  per task, the amount of RAM available, and CPU consumption of each task.
</description>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx200m</value>
<description>
  You can specify other Java options for each map or reduce task here,
  but most likely you will want to adjust the heap size.
</description>
</property>
<property>
<name>dfs.name.dir</name>
<value>/nutch/filesystem/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/nutch/filesystem/data</value>
</property>

<property>
<name>mapred.system.dir</name>
<value>/nutch/filesystem/mapreduce/system</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/nutch/filesystem/mapreduce/local</value>
</property>

<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>


My urls/urllist.txt contains almost 100 seed urls and depth is 10 but it seems there is little crawling done.


regards
--monirul


   
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.