Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

infinite loop-problem

Felix Zimmermann

2008-06-16

Replies: Find Java Web Hosting

Author LoginPost Reply
Hi,



crawling the webpage http://www.bmj.de, I suppose to be caught in an
infinite loop. However, Nutch is fetching since two days and there seems to
be no end.



I need every linked document from this website.



My configuration:



A. The craw-urlfilter.txt:



1. I removed the line that is to break loops in case of 3+ slashes. I think,
this is OK in my case and this does not cause my problem.

2. URLFilter is +^http://www.bmj.de/

3. Command-line-option "nutch crawl .. -depth 10 -topN 10000"



B. I set up the nutch-config to fetch first and to parse afterwards in order
to increase fetching speed.





Is it because of the session-IDs and navigation-strings in the URLs? They
are like this:

http://www.bmj.de/enid/3323c15e419390ec405dcc561513c2d3,1489d6706d635f696409
2d0935313835093a0979656172092d0932303038093a096d6f6e7468092d093035093a095f74
72636964092d0935313835/Pressestelle/Pressemitteilungen_58.html





How can I deal with this?



I´ m running Nutch/ SOLR like proposed by Doğacan Güney et. al in NUTCH-442,
see https://issues.apache.org/jira/browse/NUTCH-442 with Tomcat 6 and Ubuntu
8.04.





Thanks

Felix.

©2008 java2.5341.com - Jax Systems, LLC, U.S.A.