Mailing List
Home
Forum Home
Maven - Project building tool
Axis - Java SOAP implementation
Cocoon - MVC web framework based on XML/XSL
Lucene - Full-featured text search engine APIs
Log4J - A log library
Fop - Create PDF, PCL, PS, SVG, XML driven by XSL formatting objects.
POI - Java Excel, Word and other Microsoft Office files manipulating library
Oracle database error code ...
Subjects
log4j warning: No appenders could be found
java security AccessControlException: access denied (java io FilePermission clie
java lang InstantiationException: org apache tools ant Main
Apache Axis Tutorial
Struts <logic iterate >
log4j properties How to parse outpu to multiple files
configuring log4j with BEA Weblogic 8 1
How to use XSL FOP Java together
JSP precompile
Servlet File Download dialog problem (IE6,Adobe 6 0)
Proposal: Adding jar manifest classpath in jar and war plugins
Unsupported major minor version 48 0 problem while running the an
   telope task
java security AccessControlException: access denied (java io FilePermission
axis wsdl2java Ant Task usage
net sf hibernate MappingException: Error reading resource: test/User hbm xml
Building EAR ANT Script for websphere 5 0
CREATING WAR Files
jsp data into Excel
Classpath problem
Jboss 3 2 3+ vs Tomcat Axis Question
RE: How to include jars and add them into the MANIFEST MF/Class Path
attribute
Printing problem
InstantiationException
Couldn 't find trusted certificate
Please : How can one install ant 1 6 0 under Eclipse 2 1 ?
Excel: Too many different cell formats
Running junit tests fails
XDoclet, Struts and Maven: Where to start? SOLUTION
1 3 final: now giving me java io FileNotFoundException (Too many
open files)
AXIS: tomcat timeout ?
 
Search:  
Power your search with and, or, +, -, or "some phrase" operators.
(Offtopic) The unicode name for a character

(Offtopic) The unicode name for a character

2004-12-22       - By Peter Pimley

 Back
Reply:     1     2     3     4     5  


Hi everyone,

The Question:
In Java generally, Is there an easy way to get the unicode name of a
character?  (e.g. "LATIN SMALL LETTER A" from 'a')


The Reasoning (for those who are interested):
The documents I'm indexing have quite a lot of characters that are
basically variations on the basic A-Z ones.  In my analysis step, I'd
like to convert these to their closest equivalent in the basic A-Z set.

For some letters, this is easy.  An example is the e-acute character
(00E9 LATIN SMALL LETTER E WITH ACUTE).  I'd like to turn that into
plain 'e'.  I can do that by using the IBM ICU4J tools to decompose the
single character into two; 'e' and 0301 COMBINING ACUTE ACCENT.  Then I
can strip all characters that fail Character.isLetterOrDigit.  That
works fine.

Some characters however do not decompose.  An example is the character
01A4 LATIN CAPITAL LETTER P WITH HOOK.  I'd like to replace that with
'P', but it does not decompose into P + something.

I'm considering taking the unicode name for each character I encounter
and regexping it against something like:
^LATIN .* LETTER (.) WITH .*$
... to try and extract the single A-Z|a-z character.


-- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ------
To unsubscribe, e-mail: lucene-user-unsubscribe@(protected)
For additional commands, e-mail: lucene-user-help@(protected)


Earn $52 per hosting referral at Lunarpages.