Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

getting content from url - encoding problem

Onur Deniz

2008-09-01

Replies: Find Java Web Hosting

Author LoginPost Reply

 hi,

 I am using nutch just to crawl some web-sites. I'm not using searching facility.
 I'm using nutch using only command line options. I did not make any change in source code( but in conf. files like url-filter)...
 I'm calling command line options from scripts and execute thoses scripts using Runtime.getRuntime.exec(...) in java. (well, a bit longer way, but it seemed easier than running from eclipse at first)

 I know how to get content/parsetext of an URL in commandline. ( bin/nutch readseg -get .... ).

 Getting parsetext is ok because nutch handles encoding of the site. But when I try to get content of the page using the command (bin/nutch readseg -get) I faced an encoding problem;
page is in windows-1254. but I think the command returns content in utf-8. because some special characters(ş,ç,ğ,ü,ı) are dislpayed with displayement character ( <?> ).
 so, my questions are,
 how does the command (bin/nutch readseg -get ... -nofetch -nogenerate -noparse -noparsedata -noparsetext) returns the content of the page? i mean, does it parses the content according to its encoding? or does it returns the content in utf-8 defalut?
 
 any suggestions? any solutions?

 thanks all.


 onur deniz



©2008 java2.5341.com - Jax Systems, LLC, U.S.A.