Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Nutch searcher keeps reading CVS directories

afan0804

2008-09-05

Replies: Find Java Web Hosting

Author LoginPost Reply

Hi All,

My problem occurs when this code is called:
Summary[] summaries = nbean.getSummary(details, query);
where nbean is a Nutchbean, query being a Query object, and details being
HitDetails[].

I get this message:
[9/5/08 16:37:07:203 MDT] 00000034 SystemErr   R 08/09/05 16:37:07 FATAL
searcher.FetchedSegments: java.io.FileNotFoundException: C:/[path to crawl
folder]/segments/20080828123423/parse_text/CVS/data

Since this code is being submitted onto CVS, each level contains an
auto-generate CVS directory. My guess is that Nutch is reading those CVS
directories as part of the segment and searching for the "data" file, which
does not exist in the CVS directory.

I wish to ignore those CVS directory instead of removing them (since they
are needed for CVS).

It seems that the path to the segment sub-directory is processed in:
org.apache.nutch.searcher.FetchedSegments
  private MapFile.Reader[] getReaders(String subDir) throws IOException {
   return MapFileOutputFormat.getReaders(fs, new Path(segmentDir,
subDir), this.conf);
  }

I have tried passing in C:/[path to crawl
folder]/segments/20080828123423/parse_text/part-00000, but then the error
becomes
[9/5/08 14:34:08:453 MDT] 0000002a SystemErr   R 08/09/05 14:34:08 FATAL
searcher.FetchedSegments: java.io.FileNotFoundException: C:/[path to crawl
folder]/segments/20080828123423/parse_text/part-00000/CVS/data

Any ideas? Is it possible to get Hadoop to ignore directories named "CVS"?
Or is there a way I can point directly to the data file?

Thank you very much,
Angela Fan
--
Sent from the Nutch - User mailing list archive at Nabble.com.

©2008 java2.5341.com - Jax Systems, LLC, U.S.A.