Author Login
Post Reply
Let me ask the question in a different way, hopefully you guys can shed some
lights.
The two ways I used "readseg -get" and "readseg -dump" gave three different
texts:
1) The Chinese text in "parsetext" section is all correct (via -get)
2) The Chinese text in html is all messed up (via -get)
3) The Chinese text in html is largely correct, but messed up especially
near Roman punctuations and braces.
Guess I need to know if this is a fetch problem or readseg problem before
plunging in the source (as a greenhand).
Thanks in adv!
---------- Forwarded message ----------
From: wuwuengr@(protected)>
Date: 2008/10/14
Subject: Fwd: Fetch/Dump problem: Some Chinese characters incorrect.
To: nutch-user@(protected)
And it's becoming weirder when I used "readseg -get".
The Chinese text in "parsetext" section is all correct, while the main html
page is totally messed up, both different from what I got with "readseg
-dump".
Anybody has a clue? Seems to be a SegmentReader problem, which for some
reason used shaky encoding/conversion pulling text from segments?
By the way, all the Chinese characters are in three-byte UTF-8.
---------- Forwarded message ----------
From: wuwuengr@(protected)>
Date: 2008/10/13
Subject: Fetch/Dump problem: Some Chinese characters incorrect.
To: nutch-user@(protected)
I obtained some Chinese language webpages via "nutch fetch". But some
Chinese characters do not come out right after I dumped the segment back to
html pages. For instance:
http://www.dianping.com/shop/501079/
has title portion:
<head><title>
韶山冲(徐汇店)(图)_上海_大众点评网
</title>
However, I got this after dumping:
<head><title>
韶山�1¤7(徐汇庄1¤7)(�1¤7)_上海_大众点评罄1¤7
</title>
The charset specified in the page is "UTF-8". As I includeded the following
in "nutch-site.xml"
<name>parser.character.encoding.default</name>
<value>UTF-8</value>
It makes no difference.
What could be the problem?
[image: 回复时引用此帖] <http://newreply.php?do=newreply&p=5869>