  | |  | (Offtopic) The unicode name for a character | (Offtopic) The unicode name for a character 2004-12-22 - By Peter Pimley
Back
Hi everyone,
The Question: In Java generally, Is there an easy way to get the unicode name of a character? (e.g. "LATIN SMALL LETTER A" from 'a')
The Reasoning (for those who are interested): The documents I'm indexing have quite a lot of characters that are basically variations on the basic A-Z ones. In my analysis step, I'd like to convert these to their closest equivalent in the basic A-Z set.
For some letters, this is easy. An example is the e-acute character (00E9 LATIN SMALL LETTER E WITH ACUTE). I'd like to turn that into plain 'e'. I can do that by using the IBM ICU4J tools to decompose the single character into two; 'e' and 0301 COMBINING ACUTE ACCENT. Then I can strip all characters that fail Character.isLetterOrDigit. That works fine.
Some characters however do not decompose. An example is the character 01A4 LATIN CAPITAL LETTER P WITH HOOK. I'd like to replace that with 'P', but it does not decompose into P + something.
I'm considering taking the unicode name for each character I encounter and regexping it against something like: ^LATIN .* LETTER (.) WITH .*$ ... to try and extract the single A-Z|a-z character.
-- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ------ To unsubscribe, e-mail: lucene-user-unsubscribe@(protected) For additional commands, e-mail: lucene-user-help@(protected)
Earn $52 per hosting referral at Lunarpages.
|
|
 |