• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • paul wheaton
  • Liutauras Vilda
  • Ron McLeod
Sheriffs:
  • Jeanne Boyarsky
  • Devaka Cooray
  • Paul Clapham
Saloon Keepers:
  • Scott Selikoff
  • Tim Holloway
  • Piet Souris
  • Mikalai Zaikin
  • Frits Walraven
Bartenders:
  • Stephan van Hulst
  • Carey Brown

Need alternative to .toUpperCase(), messes up some characters

 
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I need to uppercase my part numbers before sending them to SAP. We were using .toUpperCase(), but recently ran into a part that contained a μ (latin Mu, micro) character. The .toUpperCase() method turns this into 'M'! I know this is kind of technically correct, but I doubt its ever what the developer wants.

Anyway, does anyone have a good alternative idea? In perl I would just say "$foo =~ tr/a-z/A-Z/" is there anything like that in Java?

I'm thinking about just looping over the string and upper casing each letter as long as its in the normal ascii range.

Thanks.
 
Ryan Stille
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Turns out looping over the string works pretty good. I get the ascii code of the character, and if its between 97 and 122 I toUpperCase() it. Benchmarking shows it takes 0ms even with a 50 character string.

I'm not sure this will work perfectly though, if someone pastes in a 'd' for example, could that 'd' be unicode and end up not matching my 97 to 122 check? From looking at the unicode/ascii charts it looks like there is no way 'd' can be represented with a higher number so it should work ok?
 
Sheriff
Posts: 3064
12
Mac IntelliJ IDE Python VI Editor Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yes, that should work fine. String.toUpperCase() makes letters from all languages uppercase. (M is an uppercase mu in Greek.) There's a version of toUpperCase() that takes a Locale, but I just took a look at the source code, and it seems like the only thing the Locale is used for is to handle a special case for Turkish. The looping should be nice and speedy, and I can pretty much guarantee that's what Perl is doing under the covers anyway. You might want to benchmark 10,000 50 character strings. Getting sub-mil times for a single String is no great shakes!

By the way, I assume you build up a character array in your loop and create a String from that. You want to be careful that you aren't building up a bunch of intermediate Strings.
 
Rancher
Posts: 436
2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
This will only work for the simplest use case; ASCII and the letter range a-z are not even sufficient to write all english words. Please see this article about encodings. In short: Relying on ASCII only is a nogo.

If you have certain characters that should be treated special, like not turning a greek lowercase m (µ) to the uppercase version (Μ - that is a greek capital M even if it looks like a latin capital M! It has a different unicode code point.), then you have to define your exceptions and e.g. split the string around it. The one use case I can think of are mathematical terms which should not treated by a toUpperCase. They should be "marked" or masked in your source string (e.g. by XML tags) so you can split them out and spare them.

0 ms is quite fast, even for a string as short as 50 characters. This is a sign of a faulty benchmark. The method per se may be fast enough for your needs though.
 
Ryan Stille
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have read that article before, but it was a good refresher to read again anyway, thanks.

Even if the user pastes in some UTF-8 text, I'm pretty sure my code that returns me the code for the character will return 63 for 'c', not 'U+0063'. If I am misunderstanding how that works - in the end I'm only talking about doing this to part numbers, so I think this will be pretty safe.
 
Greg Charles
Sheriff
Posts: 3064
12
Mac IntelliJ IDE Python VI Editor Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
In theory, a user could paste a lower-case Omicron into your text thinking it's the same as a lower-case O, and then you'd miss it in your conversion. I think Hauke was more concerned about non-ASCII characters that you might still want to capitalize ... like c with a cedilla, n with a tilde, and a with an acute accent for example. He may be right that excluding exceptions is the way to go. I suppose the μ in your part number stands for micro. What other special characters are there, which are also letters? If there's just a few, you could skip them while looping through the string and calling Character.toUpperCase().
 
Ryan Stille
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yes that would be another way to go about it, certainly if I run into issues this way I could compile an exceptions list and capitalize everything except whats in the list. But I think my users will run into less issues if I do it this way (capitalize only a-z). After all, this is for part numbers, not a paragraph of text. If there is an n with a tilde in the part number, it probably needs to stay that way - NOT be capitalized. That is the issue I'm running into with this part number with a micro/Mu in it, anyway.
 
Hauke Ingmar Schmidt
Rancher
Posts: 436
2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Ryan Stille wrote:After all, this is for part numbers, not a paragraph of text.



Oh, part numbers, my bad, I didn't put to much attention on that (no blush emoticon here?). Sure, for a part number you may need specific rules for changing.

But I am little concerned that part numbers allow arbitrary input. The important part here is not technical but to allow users to know why some parts change, others not. ("Maßgüldner" -> "MAßGüLDNER" instead of "MASSGÜLDNER").

Even if the user pastes in some UTF-8 text, I'm pretty sure my code that returns me the code for the character will return 63 for 'c', not 'U+0063'.



Sure, you get the numeric value, not the literal. And for most letters from latin alphabets the codes are the same for ASCII and other encodings.
 
Rancher
Posts: 5087
82
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hmmm, my feeling is that whoever decided it was OK to put a "μ" into a field called "part number" needs to be beaten with a tire iron. Actually I'm tempted to apply that to anyone who puts non-numeric characters into a field called a "number" of some sort - but that's far too common, and I would promptly be arrested for homicide at my current job. Might be worth it, though.

Perhaps the most useful thing for you to do here would be to analyze the "part numbers" you actually have. Write a program to look at all the "part numbers" you can find, from wherever your input comes from. Have the program report all instances of non-US-ASCII values that it finds. Then you and/or your users look at those examples and figure out how those exceptions need to be handled. (Do not disregard the tire iron approach here; it may yet apply.) Which is more common: using toUpperCase(), or not? Either way, you will want to create a general policy (use toUpperCase(), or don't) and then create a list of exceptions to that policy.
 
Marshal
Posts: 80085
412
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Fascinating discussion, particularly what Mike Simmons says, but I think too advanced for "beginning". Moving thread.
 
Greg Charles
Sheriff
Posts: 3064
12
Mac IntelliJ IDE Python VI Editor Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I think Mike needs to write an article, "There are no silver bullets, but tire irons are plentiful". Maybe he already has!

I like his idea about catalogging the special characters in a broad sample of part "numbers" and presenting findings to the users. Often developers need to drive requirements this way. Wasn't it Java Ranch's own Kathy Sierra who said not to just give the users what they ask for, give them what they actually want?
 
Rancher
Posts: 4804
7
Mac OS X VI Editor Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mike Simmons wrote:. Actually I'm tempted to apply that to anyone who puts non-numeric characters into a field called a "number" of some sort - but that's far too common, and I would promptly be arrested for homicide at my current job.


I completely agree with @mike, and have held that belief for decades. But more than 30 years ago, I learned that the ISBN, International Standard Book Number has an X in it. Sigh.
 
Greg Charles
Sheriff
Posts: 3064
12
Mac IntelliJ IDE Python VI Editor Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Pat Farrell wrote:But more than 30 years ago, I learned that the ISBN, International Standard Book Number has an X in it. Sigh.



Hey, X is a number in Roman numerals!

Actually, I'm not totally joking there. The last character of an ISBN is a check digit. They do a formula on the other digits and take the result modulo 11, and represent 10 as X. (Prime number modulos generally work better for check digits.) So, X is a base 11 number the same way CAFEBABE is a base 16 number.
 
Hauke Ingmar Schmidt
Rancher
Posts: 436
2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
So we are nearly back to topic . a..z are not all letters, 0..9 not all digits (and not all symbols needed to write Java numeric literals).
 
Pat Farrell
Rancher
Posts: 4804
7
Mac OS X VI Editor Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Hauke Ingmar Schmidt wrote::. a..z are not all letters, 0..9 not all digits (and not all symbols needed to write Java numeric literals).



Yeah, I mean, what about A..Z?
Even American's sometimes use capital letters.
 
Greg Charles
Sheriff
Posts: 3064
12
Mac IntelliJ IDE Python VI Editor Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Pat Farrell wrote:
Yeah, I mean, what about A..Z?
Even American's sometimes use capital letters.



Don't you mean letter's?

Americans don't need to capitalize capital letters, generally speaking of course. x2
 
Mike Simmons
Rancher
Posts: 5087
82
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Greg Charles wrote:Americans don't need to capitalize capital letters, generally speaking of course. x2


Some Texans might, I suppose.
 
Campbell Ritchie
Marshal
Posts: 80085
412
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'm sure there's lots of capital to be made from this thread.
 
Hauke Ingmar Schmidt
Rancher
Posts: 436
2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Pat Farrell wrote:

Hauke Ingmar Schmidt wrote::. a..z are not all letters, 0..9 not all digits (and not all symbols needed to write Java numeric literals).



Yeah, I mean, what about A..Z?



Hm... in my view "a" and "A" are the same letter, expressed by different symbols with different contextual semantics. Majuscule and minuscule versions are just different representations of the letter.

But I disgress.
 
Campbell Ritchie
Marshal
Posts: 80085
412
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Hauke Ingmar Schmidt wrote: . . . "a" and "A" are the same letter . . .

. . . until somebody calls you HAuke IngmAr Schmidt
 
Hauke Ingmar Schmidt
Rancher
Posts: 436
2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:

Hauke Ingmar Schmidt wrote: . . . "a" and "A" are the same letter . . .

. . . until somebody calls you HAuke IngmAr Schmidt



 
Without subsidies, chem-ag food costs four times more than organic. Or this tiny ad:
Smokeless wood heat with a rocket mass heater
https://woodheat.net
reply
    Bookmark Topic Watch Topic
  • New Topic