is it possible?
well, in eclipse it succeeded. i added some encoding code in Content.java using HtmlParser (a plugin). it workes succesfully in eclipse (I have tested using SegmentReader only, not any unit tests though).
but when compiling using ant I get compile errors.
here is the modification in Content.java in nutch-0.9.tar.gz release version (not trunk)
I have replaced the line:
buffer.append(new String(content)); // try default encoding
with
Configuration conf = NutchConfiguration.create();
HtmlParser parser = new HtmlParser();
parser.setConf(conf);
Parse parse = parser.getParse( this );
String encoding=parse.getData().getParseMeta().get("OriginalCharEncoding");
String localEncodedString="java incompatible encoding";
try{
localEncodedString = new String(content,encoding);
}
catch(Exception e){
e.printStackTrace();
}
buffer.append(localEncodedString);
here is the compile errors;
compile-core:
[javac] Compiling 165 source files to /home/onur/nutch-0.9/build/classes
[javac] /home/onur/nutch-0.9/src/java/org/apache/nutch/protocol/Content.java:39: package org.apache.nutch.parse.html does not exist
[javac] import
org.apache.nutch.parse.html.HtmlParser;
[javac] ^
[javac] /home/onur/nutch-0.9/src/java/org/apache/nutch/protocol/Content.java:240: cannot find symbol
[javac] symbol : class HtmlParser
[javac] location: class
org.apache.nutch.protocol.Content [javac] HtmlParser parser = new HtmlParser();
[javac] ^
[javac] /home/onur/nutch-0.9/src/java/org/apache/nutch/protocol/Content.java:240: cannot find symbol
[javac] symbol : class HtmlParser
[javac] location: class
org.apache.nutch.protocol.Content [javac] HtmlParser parser = new HtmlParser();
[javac] ^
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 3 errors
BUILD FAILED
/home/onur/nutch-0.9/build.xml:106: Compile failed; see the compiler error output for details.
do I need to make any other configuration to fix it? (parse-html exists in nutch-default.xml plugin.includes property, i tried also adding it in nutch-site.xml, but did not work)
or it is not intended to use plugins in core code?
any ideas?
(by the way what I'm trying to do here is to enable encoding in -get functionality.. it normally gives content in platform-default encoding (utf-8) )
thanks
onur deniz