Can you compare strings from two different character sets?
Carey Brown wrote:Can you compare strings from two different character sets?
You already figured out what was wrong and it wasn't a problem that had anything to do with character encodings, but I'd like to explain a little bit more about this.
Java internally stores characters using the UTF-16 character encoding: a Java char is a 16-bit value containing a UTF-16 character code.
When you read text from a source, for example a file, and you specify the correct encoding of the source, then the characters will be translated from the source encoding to UTF-16. Nothing will be lost in that conversion; all Unicode characters can be encoded using UTF-16, so no matter what the source encoding is, the characters will be translated completely to Java UTF-16 characters. (Note that some characters require more than 16 bits in UTF-16 - in that case, two chars will be used to represent that one character).
If you read two strings from two different files, that contain the same characters but with a different encoding, then you will end up with two Java strings that are exactly the same.
So yes, you can compare strings that came from sources with different character encodings.
Many developers are confused about character encodings. You always have to be careful that you know exactly what you are doing when reading or writing text from or to a file or other source or destination. One thing to especially watch out for in Java is that a number of methods from the standard library implicitly use the default character encoding of the system; this can lead to unexpected differences when running on Windows vs Linux, for example, which have different default character encodings. Always specify the encoding explicitly to avoid such problems.
Jesper de Jong wrote:You already figured out what was wrong and it wasn't a problem that had anything to do with character encodings, but I'd like to explain a little bit more about this...
Thanks for the info, I was wondering how that worked. I did find one special character that I had to handle as a special case because it didn't translate properly: the ellipsis.
And thanks Campbell for the links I will be looking at them.
Are the Excel'isms documented? I had
Firstname "Nickname" Lastname
which Excel output as
"Firstname ""Nickname"" Lastname"
That was the thing that tripped me up because as far as Excel was concerned they were identical but obviously my code did not. I had to write a little helper method to deal with the quotes.
Jesper de Jong wrote:If you read two strings from two different files, that contain the same characters but with a different encoding, then you will end up with two Java strings that are exactly the same.
This is misleading. When are two characters the same? The grapheme cluster À can be represented using a single code point, or using code point for the letter and a code point for the accent. When it's represented one way in one file, and another way in another file, then they will not be the same according to Java's internal UTF-16 representation.
To decompose, order and compare characters correctly, you should probably use the Collator class. Here's an example: