• Post Reply Bookmark Topic Watch Topic
  • New Topic

Character set and comparing characters  RSS feed

 
Carey Brown
Saloon Keeper
Posts: 3309
46
Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have two files that I'm reading into my program, one with character set UTF-8 and the other with ISO-8859-1 depending on the original source. It works for 99.9% of my data but I have just a few instances where I'm trying to do a String.compareTo() between a string from each of the two files where the result of compareTo() != zero yet they look identical.



Can you compare strings from two different character sets?
 
Carey Brown
Saloon Keeper
Posts: 3309
46
Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
  • Likes 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I figured it out. IIwas looking at the results from a tab delimited file in Excel. In my raw data one string had double quotes around it the other did not. Excel happily deleted the quotes so that they ended up looking the same when the data comparison showed that they were different.
 
Jesper de Jong
Java Cowboy
Sheriff
Posts: 16057
88
Android IntelliJ IDE Java Scala Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Carey Brown wrote:Can you compare strings from two different character sets?

You already figured out what was wrong and it wasn't a problem that had anything to do with character encodings, but I'd like to explain a little bit more about this.

Java internally stores characters using the UTF-16 character encoding: a Java char is a 16-bit value containing a UTF-16 character code.

When you read text from a source, for example a file, and you specify the correct encoding of the source, then the characters will be translated from the source encoding to UTF-16. Nothing will be lost in that conversion; all Unicode characters can be encoded using UTF-16, so no matter what the source encoding is, the characters will be translated completely to Java UTF-16 characters. (Note that some characters require more than 16 bits in UTF-16 - in that case, two chars will be used to represent that one character).

If you read two strings from two different files, that contain the same characters but with a different encoding, then you will end up with two Java strings that are exactly the same.

So yes, you can compare strings that came from sources with different character encodings.

Many developers are confused about character encodings. You always have to be careful that you know exactly what you are doing when reading or writing text from or to a file or other source or destination. One thing to especially watch out for in Java is that a number of methods from the standard library implicitly use the default character encoding of the system; this can lead to unexpected differences when running on Windows vs Linux, for example, which have different default character encodings. Always specify the encoding explicitly to avoid such problems.
 
Campbell Ritchie
Marshal
Posts: 56521
172
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
For further reading: there are links galore about encodings: here are a couple. 1: Joel Spolsky 2: David Zentgraf
 
Carey Brown
Saloon Keeper
Posts: 3309
46
Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jesper de Jong wrote:You already figured out what was wrong and it wasn't a problem that had anything to do with character encodings, but I'd like to explain a little bit more about this...

Thanks for the info, I was wondering how that worked. I did find one special character that I had to handle as a special case because it didn't translate properly: the ellipsis.

And thanks Campbell for the links I will be looking at them.

Are the Excel'isms documented? I had
Firstname "Nickname" Lastname

which Excel output as
"Firstname ""Nickname"" Lastname"

That was the thing that tripped me up because as far as Excel was concerned they were identical but obviously my code did not. I had to write a little helper method to deal with the quotes.
 
Campbell Ritchie
Marshal
Posts: 56521
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Carey Brown wrote:. . . And thanks Campbell for the links I will be looking at them.
That's a pleasure One of the two links quotes the other.
. . . Are the Excel'isms documented? . . .
No idea. Sorry.
 
Dave Tolls
Ranch Foreman
Posts: 3056
37
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What exactly are you trying to read?
Is it a CSV?
Or is it an Excel file?
Or are you comparing (in code) one against the other?
 
Carey Brown
Saloon Keeper
Posts: 3309
46
Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I was reading a tab delimited file "save-as" from Excel.
 
Dave Tolls
Ranch Foreman
Posts: 3056
37
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Oh right.
At least then it's all text.

I've had to compare them before now.
That's when I first encountered the surname True.
 
Stephan van Hulst
Saloon Keeper
Posts: 7962
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jesper de Jong wrote:If you read two strings from two different files, that contain the same characters but with a different encoding, then you will end up with two Java strings that are exactly the same.

This is misleading. When are two characters the same? The grapheme cluster À can be represented using a single code point, or using code point for the letter and a code point for the accent. When it's represented one way in one file, and another way in another file, then they will not be the same according to Java's internal UTF-16 representation.

To decompose, order and compare characters correctly, you should probably use the Collator class. Here's an example:
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!