character encoding and charsets 2007-05-03 - By Justin Warren
Back Hi guys..
I have an interesting problem. I am using POI to extract text from a word doc. (word 2000/03 usually). But the document is written in Chinese. So naturally, when I write the extracted text to a plaintext file, I get random ascii characters. So, I want to be able to decode the charset into UTF-8 (See http://UTF-8.ora-code.com). Is there any way to determine the charset so I can decode it?
In eclipse, I am doing a WordExtractor.getParagraphs() and if I set a breakpoint, I can see the Chinese characters. Also, I noticed that there is a property in HWPFDocument called field_27_cChFtnEdn. Is that possibly what I should be looking at?
Thanks
|
|