Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Unable to crawl all links

Amitabha Banerjee

2008-09-10

Replies: Find Java Web Hosting

Author LoginPost Reply
Hi folks,
I am unable to crawl all the links in my website. For some reason, only one
or two links are picked up by nutch.

Here is the website I am trying to index: http://www.knowmydestination.com

All links a this website are internal.

My crawl-urlfilter does not block any kind of internal links. It looks as
possible.

# accept hosts in MY.DOMAIN.NAME
+^http://www.knowmydestination.com/

# skip everything else
-.

My urls are: http://www.knowmydestination.com/

When I run:
bin/nutch crawl urls -dir crawl.kmd -depth 3 -topN 100

nutch only crwal one link
http://www.knowmydestination.com/articles/cheapfares.html

Can anyone help me figre this out.

/Amitab
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.