Got word back from Sun about this. Max Habibi was right: there was a typo in the instructions and it should read "7-bit US ASCII" instead of "8-bit US ASCII".
posted October 11, 2002 10:20 AM
Two years ago this was considered a typo.
Originally posted by Marlene Miller:
The String API says �The behavior of this constructor when the given bytes are not valid in the given charset is unspecified."
All text values, and all fields (which are text only), contain only 8 bit characters, null terminated if less than the maximum length for the field. The character encoding is 8 bit US ASCII.
Regards, Richard
Originally posted by Marlene Miller:
*If* the data in the file has the high-order bit set to 1, neither �US-ASCII� nor �UTF-8� can be relied on to convert characters correctly.
*If* the data in the file has the high-order bit set to 1, the encoding is not UTF-8, because UTF-8 would use 16 bits, not 8 bits.
Originally posted by Marlene Miller:
*If* I were going to handle 8-bit extended US ASCII, the user would set the name of the desired character set in the System properties file, because I don�t know the name of the character set.
Michal:
>> According to Max and Andrew' previous posts, we should use which between "US-ASCII" and "UTF-8" charsets?
Marlene:
I don't know.
*If* the data in the file always has the high-order bit of a byte set to 0, either �US-ASCII� and �UTF-8� can be used with String to correctly convert between 8-bit bytes and 16-bit Unicode characters.
Andrew's point about UTF-8 using all eight bits is convincing, isn't it?
Originally posted by Michal Charemza:
If it was considered a typo two years ago, does that mean it's a typo now?
Originally posted by Philippe Maquet:
For once, some point of Andrew that didn't convince me...
As UTF-8 encodes characters either on one byte or two bytes depending on the value of the character to be encoded, using UTF-8 is just a good way of taking the risk of very easily corrupting the file:Put any character - falling outside the range of characters UTF-8 encodes on one byte - in some String field value at full length (i.e 10 characters if the field's length is 10) Save it in the file
Guaranteed result: you either corrupt the next field's value or the first field's value of the next record, but in any case you corrupt the file.
The Sun Certified Java Developer Exam with J2SE 5: paper version from Amazon, PDF from Apress, Online reference: Books 24x7 Personal blog
Originally posted by Andrew Monkhouse:
But UTF-8 will always be 8 bits or 1 byte.
Nope - UTF-8 is 8 bit only. If it was UTF-16 you would be correct: you could be using 1 or 2 bytes and run the risk of corrupting your data. But UTF-8 will always be 8 bits or 1 byte.
By the way: UTF-8 "uses all bits of an octet, but has the quality of preserving the full US-ASCII range: US-ASCII characters are encoded in one octet having the normal US-ASCII value, and any octet with such a value can only stand for an US-ASCII character, and nothing else."
UTF-8 encodes UCS-2 or UCS-4 characters as a varying number of
octets, where the number of octets, and the value of each, depend on the integer value assigned to the character in ISO/IEC 10646. This
transformation format has the following characteristics (all values are in hexadecimal)
The Sun Certified Java Developer Exam with J2SE 5: paper version from Amazon, PDF from Apress, Online reference: Books 24x7 Personal blog
Originally posted by Andrew Monkhouse:
He everyone,
Sorry - I was wrong, and Phil (and others) are correct. UTF-8 can be multi byte.
Sorry for any confusion.
Regards, Andrew
Don't count your weasels before they've popped. And now for a mulberry bush related tiny ad:
a bit of art, as a gift, the permaculture playing cards
https://gardener-gift.com
|