I have text that comes through as Latin ("ISO-8859-1") and I want to simplify the encoding to ASCII ("US-ASCII"). Both are 8-bit encoding, the first supports all 256 characters (including accented letters for multiple languages) while the latter supports the first 128 characters, primarily English text.
I found one way to do it using the String constructor, namely:
The process converts characters 128-255 to the "?" character. Is there a smarter conversion available in the Java APIs? For example, it would be useful if the multiple "e" vowels with accents (upper and lower case) were converted to their non-accented counter-parts, rather than a "?". Is there anything like that available in Java?
Alternatively, I could write a parser that reads the characters one at a time and uses a look-up table for each letter, but it seems like re-inventing the wheel to me, so I thought I'd checked here first.
You might want to read this previous discussion. Short version: try java.text.Normalizer. Obligatory disclaimer: I haven't used it myself. Nor do I know anyone who has. As far as I know. But it seems like it's designed for what you're doing, so I say give it a try. And then please tell us if it works. Good luck...
No matter how many women are assigned to the project, a pregnancy takes nine months. Much longer than this tiny ad: