• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Convert string?

 
Meyer Florian
Ranch Hand
Posts: 62
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello

Using the Character class, I am able to change letters to upper case and determine if they are letters, digits or special characters.

Unfortunately, �, �, �, �, �, �, � and same uppercase characters (and many more) are recognized as letters. Is there a way in java to convert � to A, � to E and so on?

M�ller must result in MULLER and not MLLER or MUELLER...

Thanks for any help!
Florian
 
Grant Gainey
Ranch Hand
Posts: 65
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Meyer Florian:

Unfortunately, �, �, �, �, �, �, � and same uppercase characters (and many more) are recognized as letters. Is there a way in java to convert � to A, � to E and so on?

M�ller must result in MULLER and not MLLER or MUELLER...

Ummm...why would you want to do that? If someone's name is M�ller, they're going to be unhappy if it's shown as MULLER - that's not their name. All those characters up there aren't just "aeiou with funny marks" - they're different characters, just as if they were x's and z's.

The only reason I can think of for doing what you're attempting is to store the names as 7-bit ASCII, which is a really US-centric view of data.

At any rate - assuming you're really stuck on this path, the only thing I can think of would be to have a mapping table somewhere of "Weird non-US characters that Those Durn Furriners shouldn't be using" to "The Five Vowels The Computer Gods Intended".

But be prepared for your users to complain bitterly about you changing their names...

Good luck,
Grant
 
Meyer Florian
Ranch Hand
Posts: 62
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The names will not be changed. In our databases, the names will be stored as "M�ller" and - for internal search and sort purposes - also stored as "MULLER". We can't change this situation because there's a legacy system that must still work with these names.
 
Paul Clapham
Sheriff
Posts: 21416
33
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The "decomposition" part of this Unicode report should get you started.
 
Grant Gainey
Ranch Hand
Posts: 65
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Paul Clapham:
The "decomposition" part of this Unicode report should get you started.

Now that is very cool - will need to read in detail tonight.

Meyer - ahh, I understand the requirement now. Apologies if I sounded snippy - I've seen too many systems implemented where the designer was trying to "get rid of all these stupid marks", because they had no concept of anything other than ASCII.

Grant
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You may also be interested in java.text.Collator and related classes (like java.text.CollationKey). I've never really gotten around to using them, but they're apparently designed with this sort of thing in mind.
[ April 21, 2006: Message edited by: Jim Yingst ]
 
Alan Moore
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Jim Yingst:
You may also be interested in java.text.Collator and related classes (like java.text.CollationKey). I've never really gotten around to using them, but they're apparently designed with this sort of thing in mind.

You can tell a Collator to ignore accents when sorting, but it sounds like the OP needs to strip the accents so he can feed the names to a legacy system. The CollationElementIterator class could be of some use, but it would still leave you a lot of hand coding to do (I know this because I've just spent several hours fighting with it myself). I think you're better off doing as Grant said and writing up your own mapping table. If you're just converting accented letters to their unaccented equivalents, a simple switch block would do it.

But what if you receive the name as "Mueller"? Are you supposed to drop the 'e'?
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I agree. It's unfortunate that there's no getCanonicalForm() or getSimplifiedForm() on Collator or CollationKey, to return the simplest string that's considered equivalent by a given Collator. Seems like they have all the necessary tables and info buried within the class, but decline to expose it in a form that would be useful to legacy systems. Hmff. At least Collator may be useful for testing. Not that it would necessarily be more correct than a hand-customized table, but comparing the results of a Collator-based sort with other techniques could well be useful in identifying anomalies that might otherwise be difficult to detect.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic