RTFJD (the JavaDocs are your friends!) If you haven't read them in a long time, then RRTFJD (they might have changed!)
Jesse Silverman wrote:So what every Reader does is to translate some ENCODED CHARACTER STREAM (a byte stream in UTF-8 or some other Charset, which includes the encoding) to Java chars, which happen to be Unicode (actually still of course encoded in UTF-16 or possibly UTF-8 internally, which confused me) -- this is DECODING.
What every Writer does is to translate Java chars (actually still of course encoded in UTF-16 or possibly UTF-8 internally, which confused me) to some ENCODED CHARACTER STREAM (a byte stream in UTF-8 or some other Charset, which includes the encoding).
Mike Simmons wrote:
Jesse Silverman wrote:So what every Reader does is to translate some ENCODED CHARACTER STREAM (a byte stream in UTF-8 or some other Charset, which includes the encoding) to Java chars, which happen to be Unicode (actually still of course encoded in UTF-16 or possibly UTF-8 internally, which confused me) -- this is DECODING.
What every Writer does is to translate Java chars (actually still of course encoded in UTF-16 or possibly UTF-8 internally, which confused me) to some ENCODED CHARACTER STREAM (a byte stream in UTF-8 or some other Charset, which includes the encoding).
I would say that not every Writer or Reader does the encoding/decoding translation themselves. It's specifically OutputStreamWriter and InputStreamReader that perform encoding and decoding, respectively. Other classes like PrintWriter end up wrapping an OutputStreamWriter (in some constructors) to handle that functionality. And others let you provide that yourself.
RTFJD (the JavaDocs are your friends!) If you haven't read them in a long time, then RRTFJD (they might have changed!)
Jesse Silverman wrote:I continued thinking "character set" short for "coded character set" as I had for many years, the first part, but NOT the encoding when I was speaking about Unicode, at least.
(More than a decade of my professional experience was in the "bad old days" of code pages, which I mistakenly thought I had left behind for an "All Unicode" world, but apparently have not yet).
Java chars, which happen to be Unicode (actually still of course encoded in UTF-16 or possibly UTF-8 internally, which confused me)
a byte stream in UTF-8 or some other Charset, which includes the encoding).
I was totally choking on this terminology for a while. Now when we say "A reader reads chars" and "A writer writes chars" for short, I properly understand what's going on.
Stephan van Hulst wrote:
It's not good to think of strings as an encoded character stream. It represents a sequence of abstract characters. Even if it stores those abstract characters as an encoded character stream internally, that's an implementation detail.
The char primitive is a whole different matter though. You absolutely MUST remember that a char really represents an UTF-16 code unit, and NOT an abstract character.
A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.
RTFJD (the JavaDocs are your friends!) If you haven't read them in a long time, then RRTFJD (they might have changed!)
It's not good to think of strings as an encoded character stream. It represents a sequence of abstract characters. Even if it stores those abstract characters as an encoded character stream internally, that's an implementation detail.
The char primitive is a whole different matter though. You absolutely MUST remember that a char really represents an UTF-16 code unit, and NOT an abstract character.
RTFJD (the JavaDocs are your friends!) If you haven't read them in a long time, then RRTFJD (they might have changed!)
Jesse Silverman wrote:I want to think of my String values as a sequence of "code points", which is what abstract characters mean to me in a Unicode world, but at least for the extremely important special case of Unicode, that just doesn't work anymore, I constantly need to remember to think in terms of "code units" or else...
Because Java was born in a world where "code unit" was equal to "code point" in size
there is a whole field of API landmines one might step on.
Everyone seems to be able to work fine if every character is below 0x80, most seem to work pretty well with values less than 0x10000, but tons of people get confused when dealing with Strings that contain characters outside the BMP.
(Contradicts your point, if and when we forced to use any of these, we must think in terms of char/code unit instead of character/code point)
So the hashcode would be different for the same String depending on representation...
A CharSequence is a readable sequence of char values. This interface provides uniform, read-only access to many different kinds of char sequences. A char value represents a character in the Basic Multilingual Plane (BMP) or a surrogate. Refer to Unicode Character Representation for details.
Again, better be thinking in char, not character, or else.
All in all, there seem to be many places where if you think of your String of length len as being a String of characters, len characters long, your code will work until it doesn't.
To be safe
Jesse Silverman wrote:So much more briefly, I wish I could interpret that as saying "Your String is just a String of abstract Unicode characters. You don't need to worry about whether that is UTF-8, UTF-16, or something else entirely."
But we can't, because there are dozens of places that the abstraction is leaky, and we need to know if each character is one code unit(Java char), or two (surrogate pairs of Java chars representing those single abstract characters).
The UTF-16 encoding scheme was developed as a compromise and introduced with version 2.0 of the Unicode standard in July 1996.[8] It is fully specified in RFC 2781, published in 2000 by the IETF.[9][1
Java originally used UCS-2, and added UTF-16 supplementary character support in J2SE 5.0.
RTFJD (the JavaDocs are your friends!) If you haven't read them in a long time, then RRTFJD (they might have changed!)
Yes, but why are you thinking of your string as having a certain length in the first place? Why is that interesting at all?
To be safe
To be safe, stop writing code that needs to know about the length of strings.
Even if two apparently equal strings use the same encoding internally, they can still have different hash codes because strings may not be normalized. I can form the abstract character sequence "é" by using the code point "Latin Small Letter E with Acute", or as a combination of the code points "Latin Small Letter E" and "Combining Acute Accent". What about the abstract character sequence "Spaß"? This sequence can have a length of 4 or 5, depending on which string normalization algorithm you use, and differing hash codes to go along with it.
RTFJD (the JavaDocs are your friends!) If you haven't read them in a long time, then RRTFJD (they might have changed!)
RTFJD (the JavaDocs are your friends!) If you haven't read them in a long time, then RRTFJD (they might have changed!)
Sometimes the only way things ever got fixed is because people became uncomfortable.
Tim Holloway wrote:Just a general note. It has long been a habit for people to use the term "byte" and "character" interchangeably. That was never a good idea, but languages like C were really sloppy about making the distinction. A character is a single code point value, no matter how many bits are in it. A byte is either the smallest uniquely-addressable element memory or a fixed-length subdivision thereof. Historically the term "word" referenced singly-addressable units, but IBM, Motorola and others used "word" in reference to register sizes, so most machines today are byte-addressable.
Tim Holloway wrote:
A string, then can have two dimensional values which may or may not be the same value. One is the number of characters in the string, which is usually referred to as string length. The other is the number of memory units (e.g., bytes) required to hold the string. This is the string's "size".
RTFJD (the JavaDocs are your friends!) If you haven't read them in a long time, then RRTFJD (they might have changed!)
Sometimes the only way things ever got fixed is because people became uncomfortable.
RTFJD (the JavaDocs are your friends!) If you haven't read them in a long time, then RRTFJD (they might have changed!)
Stephan van Hulst wrote:
Jesse wrote:Everyone seems to be able to work fine if every character is below 0x80, most seem to work pretty well with values less than 0x10000, but tons of people get confused when dealing with Strings that contain characters outside the BMP.
This is because char is not an abstract character, but rather a UTF-16 code unit. Let's take a look at which methods of the String class will lead to confusion if the string consists of surrogate pairs. You already discussed many.
[summarization of Jesse's list elided]
This seems like a lot, but if we make the argument that char is the cause of all evil and train our developers to think "Here be dragons" whenever they see or use char, we can remove all methods that take or return a char. We're left with:
codePointBefore(int index) codePointCount(int beginIndex, int endIndex) indexOf(String str) indexOf(String str, int fromIndex) length() regionMatches(boolean ignoreCase, int toffset, String other, int ooffset, int len) regionMatches(int toffset, String other, int ooffset, int len) startsWith(String prefix, int toffset) subSequence(int beginIndex, int endIndex) substring(int beginIndex) substring(int beginIndex, int endIndex)
...
We can eliminate all methods that you usually only use in conjunction with other method calls on the same string object.
To me, that leaves length(), which is sometimes used in debugging or I/O, and which is probably the primary method that causes confusion for English/American programmers. This method by its very nature is confusing, regardless of what "string atom" you've chosen for your language: Is it a number of code points? A number of visual glyphs? Something else? A string method with this name will always be ambiguous.
Only once you start working with indices and chars retrieved from the CharSequence using its methods. In general, you should think of CharSequence as an abstract character sequence, just like String. For instance, a StringBuilder in itself isn't dangerous. It becomes dangerous when you start appending chars individually, without treating surrogate pairs as atoms. Even the reverse() method of StringBuilder does things properly.
RTFJD (the JavaDocs are your friends!) If you haven't read them in a long time, then RRTFJD (they might have changed!)
Sometimes the only way things ever got fixed is because people became uncomfortable.
Tim Holloway wrote:Whoops.
You've confused text rendering for text encoding.
Text is rendered by the graphics engine that runs your display terminal and printer (there's a lot of shared code there). How things line up will depend on whether you use logical or physical tabs (or tabs at all) and whether you use a monospaced font or a proportional font.
Back in DOS days using "glass TTY" terminals or equivalents - or typewriter devices - there often wasn't a choice of fonts. The font was hardware-encoded into the terminal itself in some cases. And it was monospaced.
Monospaced fonts are generally not as pretty or as readable as proportional fonts. In fact, my unofficial name for the Courier New font was "Courier ugly". Most displays these days are set to use proportional fonts. As are most of the fonts we use on the Ranch,
Inn a proportional font, a space, an "a", an "A", and an "m" are often different widths. So just spacing stuff out often disappoints. Hardware tabs can get around that (or CSS properties on web pages), but not every rendering medium supports hard tabs and not all of them allow setting tabs to whatever spacing you want. For Unix/DOS, the traditional hardware tab was supposed to be equivalent to 8 spaces, but in programs like Microsoft Word you could set them any way you liked.
There's a lot more that could be said, but basically, it's up to what font your display device/window is using.
RTFJD (the JavaDocs are your friends!) If you haven't read them in a long time, then RRTFJD (they might have changed!)
Sometimes the only way things ever got fixed is because people became uncomfortable.
RTFJD (the JavaDocs are your friends!) If you haven't read them in a long time, then RRTFJD (they might have changed!)
When writing in HTML, the <tt> tag was used to designate inline teletype text. It was intended to style text as it would appear on a fixed-width display, using the browser's default monotype font.
Sometimes the only way things ever got fixed is because people became uncomfortable.
Jesse Silverman wrote:I was also thinking about how an API built around UTF-16 may have some implementation simplifications because every code unit is exactly two bytes, regardless of data, making collections of code units "RandomAccess" essentially, but with complications in use around the fact that some "code points" are two "code units" and others are one.
UTF-8 on the other hand, has no consistency of "code unit" size, as they can be 1, 2, 3 or 4 bytes each
I am guessing that lengths in embedded format string specifiers [...] refers to Java char counts and not to "abstract characters" or bytes, right?
Jesse Silverman wrote:So, as you can see, I am having trouble getting sh*t to line up with printf().
Stephan van Hulst wrote:The ligature "fi" consists of two abstract characters, but in many fonts is displayed as a single glyph.
Sometimes the only way things ever got fixed is because people became uncomfortable.
Sometimes the only way things ever got fixed is because people became uncomfortable.
RTFJD (the JavaDocs are your friends!) If you haven't read them in a long time, then RRTFJD (they might have changed!)
Be careful. UTF-8 DOES have a fixed code unit size: 1 byte. We say that some Unicode code points are encoded by up to four UTF-8 code units, but each individual code unit is always exactly 1 byte.
RTFJD (the JavaDocs are your friends!) If you haven't read them in a long time, then RRTFJD (they might have changed!)
Sometimes the only way things ever got fixed is because people became uncomfortable.
Sometimes the only way things ever got fixed is because people became uncomfortable.
Tim Holloway wrote:Oh, and no, Java doesn't do print formatting different than C or any other purely character-oriented system, whether Unicode, EBCDIC, ASCII, DBCS or whatever. If's the display rendering (GUI) that's giving you the most pain.
RTFJD (the JavaDocs are your friends!) If you haven't read them in a long time, then RRTFJD (they might have changed!)
Sometimes the only way things ever got fixed is because people became uncomfortable.
Tim Holloway wrote:Don't use the Ranch webpage to prove your point. Those mis-aligned examples are not all using the same font. And in fact, they're not even using monospaced fonts! The code tag was not the final authority used for rendering there.
Tim Holloway wrote:
"printf" deals with characters and things it thinks are characters. ... All DBCS, regardless, even as you type in 8-bit ASCII source but Java String literals are not ASCII, they're Unicode.
If C/C++ has extended itself beyond DBCS to variable-length characters, that falls outside the time I monitored every change in the language. Initially it did not, however, and that's the model that Java is using, since how Java stores a char internally doesn't count.
Tim Holloway wrote:Python is a whole different bucket of snakes. It seems to be schizophrenic about whether it wants to use the OS native character set or Unicode for text operations, and I've gotten into fights with it more than once. Perl is based on old-time C. For the sake of sound sleeping, I refuse to look at what JavaScript is doing until I must.
RTFJD (the JavaDocs are your friends!) If you haven't read them in a long time, then RRTFJD (they might have changed!)
Sometimes the only way things ever got fixed is because people became uncomfortable.
RTFJD (the JavaDocs are your friends!) If you haven't read them in a long time, then RRTFJD (they might have changed!)
Sometimes the only way things ever got fixed is because people became uncomfortable.
Stephan van Hulst wrote:Don't expect an old holdover method from C++ to hide this complexity for you. If you want to do advanced formatting, you might want to use a software package like LaTeX.
Sometimes the only way things ever got fixed is because people became uncomfortable.
Tim Holloway wrote:...
But also, don't forget that the reason the emoji's weren't rendering at the same width as the characters was that the Ranch renderer was employing a different font for the emoji than for the text. It's not that monospaced fonts have magical "monospace-free" zones in them!
Windows Terminal provides a tab-based UI that supports font fallback and complex scripts, which is a significant improvement over the builtin terminal that conhost.exe provides.
RTFJD (the JavaDocs are your friends!) If you haven't read them in a long time, then RRTFJD (they might have changed!)
Sometimes the only way things ever got fixed is because people became uncomfortable.
RTFJD (the JavaDocs are your friends!) If you haven't read them in a long time, then RRTFJD (they might have changed!)