• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Smarter Charset Conversion

 
Scott Selikoff
author
Saloon Keeper
Posts: 4020
18
Eclipse IDE Flex Google Web Toolkit
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have text that comes through as Latin ("ISO-8859-1") and I want to simplify the encoding to ASCII ("US-ASCII"). Both are 8-bit encoding, the first supports all 256 characters (including accented letters for multiple languages) while the latter supports the first 128 characters, primarily English text.

I found one way to do it using the String constructor, namely:



The process converts characters 128-255 to the "?" character. Is there a smarter conversion available in the Java APIs? For example, it would be useful if the multiple "e" vowels with accents (upper and lower case) were converted to their non-accented counter-parts, rather than a "?". Is there anything like that available in Java?

Alternatively, I could write a parser that reads the characters one at a time and uses a look-up table for each letter, but it seems like re-inventing the wheel to me, so I thought I'd checked here first.
 
Mike Simmons
Ranch Hand
Posts: 3090
14
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You might want to read this previous discussion. Short version: try java.text.Normalizer. Obligatory disclaimer: I haven't used it myself. Nor do I know anyone who has. As far as I know. But it seems like it's designed for what you're doing, so I say give it a try. And then please tell us if it works. Good luck...
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic