• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Tim Cooke
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Liutauras Vilda
Sheriffs:
  • Rob Spoor
  • Junilu Lacar
  • paul wheaton
Saloon Keepers:
  • Stephan van Hulst
  • Tim Moores
  • Tim Holloway
  • Carey Brown
  • Scott Selikoff
Bartenders:
  • Piet Souris
  • Jj Roberts
  • fred rosenberger

Java Terminology For Charset

 
Bartender
Posts: 1737
63
Eclipse IDE Postgres Database C++ Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
As I was studying IOStreams both for certification and for general use, I became quite confused about Reader and Writer classes a number of times.

One source of the confusion is that Java (not always consistently) uses the term Charset to encompass both the repertoire of characters and their names (Java calls that part a 'coded character set', and also the encoding used.

I continued thinking "character set" short for "coded character set" as I had for many years, the first part, but NOT the encoding when I was speaking about Unicode, at least.  Charset implies ENCODING in Java!

(More than a decade of my professional experience was in the "bad old days" of code pages, which I mistakenly thought I had left behind for an "All Unicode" world, but apparently have not yet).

So reading this is beyond mandatory, if I had a time machine I'd go back and force myself to read the terminology section here before reading anything else:
https://docs.oracle.com/en/java/javase/16/docs/api/java.base/java/nio/charset/Charset.html

So what every Reader does is to translate some ENCODED CHARACTER STREAM (a byte stream in UTF-8 or some other Charset, which includes the encoding) to Java chars, which happen to be Unicode (actually still of course encoded in UTF-16 or possibly UTF-8 internally, which confused me) -- this is DECODING.

What every Writer does is to translate Java chars  (actually still of course encoded in UTF-16 or possibly UTF-8 internally, which confused me) to some ENCODED CHARACTER STREAM (a byte stream in UTF-8 or some other Charset, which includes the encoding).

I was totally choking on this terminology for a while.  Now when we say "A reader reads chars" and "A writer writes chars" for short, I properly understand what's going on.
 
Master Rancher
Posts: 4218
57
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Jesse Silverman wrote:So what every Reader does is to translate some ENCODED CHARACTER STREAM (a byte stream in UTF-8 or some other Charset, which includes the encoding) to Java chars, which happen to be Unicode (actually still of course encoded in UTF-16 or possibly UTF-8 internally, which confused me) -- this is DECODING.

What every Writer does is to translate Java chars  (actually still of course encoded in UTF-16 or possibly UTF-8 internally, which confused me) to some ENCODED CHARACTER STREAM (a byte stream in UTF-8 or some other Charset, which includes the encoding).



I would say that not every Writer or Reader does the encoding/decoding translation themselves.  It's specifically OutputStreamWriter and InputStreamReader that perform encoding and decoding, respectively.  Other classes like PrintWriter end up wrapping an OutputStreamWriter (in some constructors) to handle that functionality.  And others let you provide that yourself.
 
Jesse Silverman
Bartender
Posts: 1737
63
Eclipse IDE Postgres Database C++ Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mike Simmons wrote:

Jesse Silverman wrote:So what every Reader does is to translate some ENCODED CHARACTER STREAM (a byte stream in UTF-8 or some other Charset, which includes the encoding) to Java chars, which happen to be Unicode (actually still of course encoded in UTF-16 or possibly UTF-8 internally, which confused me) -- this is DECODING.

What every Writer does is to translate Java chars  (actually still of course encoded in UTF-16 or possibly UTF-8 internally, which confused me) to some ENCODED CHARACTER STREAM (a byte stream in UTF-8 or some other Charset, which includes the encoding).



I would say that not every Writer or Reader does the encoding/decoding translation themselves.  It's specifically OutputStreamWriter and InputStreamReader that perform encoding and decoding, respectively.  Other classes like PrintWriter end up wrapping an OutputStreamWriter (in some constructors) to handle that functionality.  And others let you provide that yourself.



Interesting, so every CALL to the Writer methods will result in encoding from (possibly something else to possibly String and then) chars to bytes in some Charset encoding, and every CALL to the Reader methods will result in decoding from bytes in some Charset encoding to chars (to possibly String and possibly something else on top) , but the class you are actually invoking the method on is NOT necessarily doing that itself, it may be delegating the encoding to something it wraps [or lets you provide yourself, which I hadn't seen].  I think specifically PrintWriter and PrintStream go down to char level, but the char to byte encoding may be delegated?
 
Saloon Keeper
Posts: 13982
315
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
As you say, a Charset is not really a character set, but rather an encoding scheme. Java really buggered up the naming there.

I would go as far as to say that almost no writer encodes characters directly before they write them away. As Mike hinted at, many readers/writers decorate a more primitive reader/writer with extra functionality, at the heart of which are the InputStreamReader and OutputStreamWriter classes.

Not even these classes perform encoding directly. They delegate encoding to the Charset, which in turn delegates encoding to a CharsetEncoder.

Jesse Silverman wrote:I continued thinking "character set" short for "coded character set" as I had for many years, the first part, but NOT the encoding when I was speaking about Unicode, at least.



For Unicode at least it's correct to think "character set", because Unicode IS a character set, and not an encoding. UTF-8 and UTF-16 are examples of encodings that encode Unicode. ASCII is an example of an encoding that encodes a subset of Unicode.

(More than a decade of my professional experience was in the "bad old days" of code pages, which I mistakenly thought I had left behind for an "All Unicode" world, but apparently have not yet).


Most default encodings used by systems that Java is installed on are code pages, and it doesn't look like they will switch to UTF-8 any time soon. Code pages aren't necessarily bad, it's just that many are so similar that people tend to forget to think about them as completely different encodings that encode different character sets.

Java chars, which happen to be Unicode (actually still of course encoded in UTF-16 or possibly UTF-8 internally, which confused me)


It's not good to think of strings as an encoded character stream. It represents a sequence of abstract characters. Even if it stores those abstract characters as an encoded character stream internally, that's an implementation detail.

The char primitive is a whole different matter though. You absolutely MUST remember that a char really represents an UTF-16 code unit, and NOT an abstract character.

a byte stream in UTF-8 or some other Charset, which includes the encoding).


A Charset does not include an encoding. A Charset IS an encoding. As I said earlier, poor naming by the Java team.

I was totally choking on this terminology for a while.  Now when we say "A reader reads chars" and "A writer writes chars" for short, I properly understand what's going on.


Great.
 
Jesse Silverman
Bartender
Posts: 1737
63
Eclipse IDE Postgres Database C++ Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:

It's not good to think of strings as an encoded character stream. It represents a sequence of abstract characters. Even if it stores those abstract characters as an encoded character stream internally, that's an implementation detail.

The char primitive is a whole different matter though. You absolutely MUST remember that a char really represents an UTF-16 code unit, and NOT an abstract character.



I won't revisit the queasy, sleazy feeling I get whenever I seem to be disagreeing with you.  Ooops, I just did.
If all you meant is that you don't need to think about whether String objects are encoded in UTF-16LE or UTF-16BE, well fine, but that isn't saying much because...
I want to think of my String values as a sequence of "code points", which is what abstract characters mean to me in a Unicode world, but at least for the extremely important special case of Unicode, that just doesn't work anymore, I constantly need to remember to think in terms of "code units" or else...

The best way to think of String is however people will make the fewest mistakes in API usage, I think.  Because Java was born in a world where "code unit" was equal to "code point" in size, and that is no longer true, there is a whole field of API landmines one might step on.  Everyone seems to be able to work fine if every character is below 0x80, most seem to work pretty well with values less than 0x10000, but tons of people get confused when dealing with Strings that contain characters outside the BMP.  Now that these dang emojiis have become so popular, it no longer just applies to the people in the East who were already good at such things!

So I was trying to stick to the way you say to think of it, BUT here are the places we need to think in Java char/code unit versus beautiful, platonic ideal "code points":

From the docs:

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.


(Contradicts your point, if and when we forced to use any of these, we must think in terms of char/code unit instead of character/code point)
[Notice, the String API seems to pretty consistently use char to mean "code unit" and character to mean "code point" AAAAAAAAARRRRRRRRRRRRRRRRRGGGGGGGGGHHHHH!!!]

The String class provides methods for dealing with Unicode code points (i.e., characters), in addition to those for dealing with Unicode code units (i.e., char values).
(Reinforces your point, but only goes so far before we need one of the others and are back to thinking in code units...)

String methods which you'd better think of the String as being composed of "code units" to understand their behavior:


Note you get a nice neat "code point" out, but your index must be in code units, not number of code points, where that makes any difference.  So, indexing here is all about "code units".

Same here, indexes are in "code units", the count out is in "code points":

Returns the number of Unicode code points in the specified text range of this String. The text range begins at the specified beginIndex and extends to the char at index endIndex - 1. Thus the length (in chars) of the text range is endIndex-beginIndex. Unpaired surrogates within the text range count as one code point each.

This one goes the other way, the input index is in "code points", the output is in "number of code units":



Index values are all measured in code points(more important), output is array of chars(less important, it ain't a String no more)


Indexes are measured in "code units/number of char values", not "number of code points/characters"


Index measured in "code units".


So the hashcode would be different for the same String depending on representation...




You can pass in a nice neat "code point", good, but your answer is returned in number of "code units".


Both indexes of where to search from and resulting index that gets returned are measured in "code units", tho we can search for a "code point" which is nice.


Returned index is measured in "code units".


All indices are specified in char values (Unicode code units).
ch - a character (Unicode code point).


Boy, I am beating this dead moose hard!


Both indexes are measured in "code units".


yeah...


No option to deal with pure, ideal "code points", only char values.  Must think in char/"code unit" to use this.


A CharSequence is a readable sequence of char values. This interface provides uniform, read-only access to many different kinds of char sequences. A char value represents a character in the Basic Multilingual Plane (BMP) or a surrogate. Refer to Unicode Character Representation for details.
Again, better be thinking in char, not character, or else.


yada

yada

yada


This bugger looks like it is giving us back "code points", but it is NOT.  They are simply zero-extended char values, including meaningless zero-extended surrogate pair components!!

I will stop beating that dead moose.  For all of these you need to think and count in char/code unit, and to realize that Strings aren't just made of "characters" (which can be multiple code units)...

Parts of the String API let us forget about all this mess, whether we are working with Strings that have only characters represented by precisely one code unit in them or NOT (Yay!):
String methods that it is fine and good to think of the String as composed of "code points", however many code units that might be as just an implementation detail:


Actually I'll grant you all the constructors, you don't need to worry about what the String you get out will be made out of inside, there's no indexes into that to trouble with.




I'm not sure about these.  In practice, it is indeed doing a char by char comparison, but I could imagine something else happening and they could still work without an API change...


OK, those could all just work regardless of whether String was made out of char or characters, that implementation detail doesn't leak out of these calls...actually, not sure even there, the lengths in format specifiers are numbers of char, not numbers of characters?  I DON'T EVEN KNOW, I'D HAVE TO LOOK IT UP 🤷‍♂️





I think...




[actually not sure on that one whether you ever need to think about code point versus code unit]

public String toUpperCase​(Locale locale)
[same deal]


[Wait, this legacy method in wide use only works correctly for ASCII, it misses all manner of legal Unicode WhiteSpace, so let's just forget this one!]



All in all, there seem to be many places where if you think of your String of length len as being a String of characters, len characters long, your code will work until it doesn't.
To be safe, you need to always keep in mind that the instant you step outside the BMP, each character (code point) may be one or two Java char (code units) in length, except I think as noted above.
That a String in Java is just some number of legal Unicode characters in some encoding or maybe another is an extremely leaky abstraction.
 
Jesse Silverman
Bartender
Posts: 1737
63
Eclipse IDE Postgres Database C++ Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:


It's not good to think of strings as an encoded character stream. It represents a sequence of abstract characters. Even if it stores those abstract characters as an encoded character stream internally, that's an implementation detail.

The char primitive is a whole different matter though. You absolutely MUST remember that a char really represents an UTF-16 code unit, and NOT an abstract character.



So much more briefly, I wish I could interpret that as saying "Your String is just a String of abstract Unicode characters.  You don't need to worry about whether that is UTF-8, UTF-16, or something else entirely."

But we can't, because there are dozens of places that the abstraction is leaky, and we need to know if each character is one code unit(Java char), or two (surrogate pairs of Java chars representing those single abstract characters).

That with many String values [pure BMP] we can forget this and work by accident might make it worse, not better, it is easy to think you are on top of things when you/your code is making assumptions that only hold for the values you tested...
 
Stephan van Hulst
Saloon Keeper
Posts: 13982
315
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Jesse Silverman wrote:I want to think of my String values as a sequence of "code points", which is what abstract characters mean to me in a Unicode world, but at least for the extremely important special case of Unicode, that just doesn't work anymore, I constantly need to remember to think in terms of "code units" or else...


I will use "Unicode" to refer to the character set, not the consortium or the specification. I don't see how Unicode is special. Unicode is just a character set that assigns an index (a "code point") to every abstract character. For the purposes of this discussion, I will use "code point" and "abstract character" interchangeably. Unicode hasn't got anything to do with "code units". A "code unit" is a property of an encoding, not a character set.

The reason you constantly have to think in terms of code units is because a handful of String methods expose implementation details, usually in the form of a char or an index.

Because Java was born in a world where "code unit" was equal to "code point" in size


Java was born in a world where people didn't bother to learn the difference between a code unit and a code point. However, the mere fact that they chose UTF-16 and not UCS-2 to be their "string atoms" indicated that the designers had at least had some notion of the difference between code units and code points. The mistake wasn't necessarily using UTF-16, but rather exposing the char primitive as the unit of work and then also naming the char primitive so poorly and ambiguously.

there is a whole field of API landmines one might step on.


Yes. Java's greatest sin is that they wanted to be like C++. The language and API is full of horrible leaky abstraction, because they didn't dare to stray too far away from their parental language. I will consider the landmines in the next paragraph.

Everyone seems to be able to work fine if every character is below 0x80, most seem to work pretty well with values less than 0x10000, but tons of people get confused when dealing with Strings that contain characters outside the BMP.


This is because char is not an abstract character, but rather a UTF-16 code unit. Let's take a look at which methods of the String class will lead to confusion if the string consists of surrogate pairs. You already discussed many.

  • String​(char[] value)
  • String​(char[] value, int offset, int count)
  • charAt(int index)
  • chars()
  • codePointBefore​(int index)
  • codePointCount​(int beginIndex, int endIndex)
  • copyValueOf(char[] data)
  • copyValueOf(char[] data, int offset, int count)
  • getChars(int srcBegin, int srcEnd, char[] dst, int dstBegin)
  • indexOf​(int ch)
  • indexOf​(int ch, int fromIndex)
  • indexOf​(String str)
  • indexOf​(String str, int fromIndex)
  • lastIndexOf​(int ch)
  • lastIndexOf​(int ch, int fromIndex)
  • length()
  • regionMatches​(boolean ignoreCase, int toffset, String other, int ooffset, int len)
  • regionMatches​(int toffset, String other, int ooffset, int len)
  • replace​(char oldChar, char newChar)
  • startsWith​(String prefix, int toffset)
  • subSequence​(int beginIndex, int endIndex)
  • substring​(int beginIndex)
  • substring​(int beginIndex, int endIndex)
  • valueOf​(char c)
  • valueOf​(char[] data)
  • valueOf​(char[] data, int offset, int count)

  • This seems like a lot, but if we make the argument that char is the cause of all evil and train our developers to think "Here be dragons" whenever they see or use char, we can remove all methods that take or return a char. We're left with:

  • codePointBefore​(int index)
  • codePointCount​(int beginIndex, int endIndex)
  • indexOf​(String str)
  • indexOf​(String str, int fromIndex)
  • length()
  • regionMatches​(boolean ignoreCase, int toffset, String other, int ooffset, int len)
  • regionMatches​(int toffset, String other, int ooffset, int len)
  • startsWith​(String prefix, int toffset)
  • subSequence​(int beginIndex, int endIndex)
  • substring​(int beginIndex)
  • substring​(int beginIndex, int endIndex)

  • (Contradicts your point, if and when we forced to use any of these, we must think in terms of char/code unit instead of character/code point)


    It does contradict my point, but if we only consider the methods I just listed, I don't think it's enough to throw out the baby with the bathwater.

    Who uses regionMatches and subSequence?

    The most dangerous methods are codePointBefore​, codePointCount​, indexOf, length, startsWith(String prefix, int toffset) and substring. However, as long as we consider an index to be an "abstract index", not an integer, and only use it in conjunction with other string methods, e.g.:

    We can eliminate all methods that you usually only use in conjunction with other method calls on the same string object.

    To me, that leaves length(), which is sometimes used in debugging or I/O, and which is probably the primary method that causes confusion for English/American programmers. This method by its very nature is confusing, regardless of what "string atom" you've chosen for your language: Is it a number of code points? A number of visual glyphs? Something else? A string method with this name will always be ambiguous.

    So the hashcode would be different for the same String depending on representation...


    Knowing about UTF-16 won't help you here. Even if two apparently equal strings use the same encoding internally, they can still have different hash codes because strings may not be normalized. I can form the abstract character sequence "é" by using the code point "Latin Small Letter E with Acute", or as a combination of the code points "Latin Small Letter E" and "Combining Acute Accent". What about the abstract character sequence "Spaß"? This sequence can have a length of 4 or 5, depending on which string normalization algorithm you use, and differing hash codes to go along with it.

    A CharSequence is a readable sequence of char values. This interface provides uniform, read-only access to many different kinds of char sequences. A char value represents a character in the Basic Multilingual Plane (BMP) or a surrogate. Refer to Unicode Character Representation for details.
    Again, better be thinking in char, not character, or else.


    Only once you start working with indices and chars retrieved from the CharSequence using its methods. In general, you should think of CharSequence as an abstract character sequence, just like String. For instance, a StringBuilder in itself isn't dangerous. It becomes dangerous when you start appending chars individually, without treating surrogate pairs as atoms. Even the reverse() method of StringBuilder does things properly.

    All in all, there seem to be many places where if you think of your String of length len as being a String of characters, len characters long, your code will work until it doesn't.


    Yes, but why are you thinking of your string as having a certain length in the first place? Why is that interesting at all?

    To be safe


    To be safe, stop writing code that needs to know about the length of strings.
     
    Stephan van Hulst
    Saloon Keeper
    Posts: 13982
    315
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Jesse Silverman wrote:So much more briefly, I wish I could interpret that as saying "Your String is just a String of abstract Unicode characters.  You don't need to worry about whether that is UTF-8, UTF-16, or something else entirely."

    But we can't, because there are dozens of places that the abstraction is leaky, and we need to know if each character is one code unit(Java char), or two (surrogate pairs of Java chars representing those single abstract characters).


    I think the matter comes down to: How leaky must an abstraction be before you treat it as its implementation, and not as the abstraction?

    Do we not say that the Date class represents an absolute instant in time, even though its implementation as a number of milliseconds relative to the Unix epoch shines through brilliantly? Who really thinks of Date as a Unix timestamp?
     
    Jesse Silverman
    Bartender
    Posts: 1737
    63
    Eclipse IDE Postgres Database C++ Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Stephan:

    Thanks for being a "landmine Sherpa"!

    One stupid detail in terms of UTF-16 and Java history courtesy of Wikipedia:

    The UTF-16 encoding scheme was developed as a compromise and introduced with version 2.0 of the Unicode standard in July 1996.[8] It is fully specified in RFC 2781, published in 2000 by the IETF.[9][1


    If I am not mistaken, much of the String API was already "cast in concrete" by 1995, so I can't really blame Java for not using it from day one, as it wasn't a published standard yet.

    Java originally used UCS-2, and added UTF-16 supplementary character support in J2SE 5.0.


    Java SE 5.0 was the first version to drop the "Java 2" nickname, but I presume that quote is true.
    Which means people using Java 2 1.4 in 2003 were 8 years into Java programming and still using UCS-2 if so, that is kind of a while.
    Of course, in the early years, almost everything was still BMP, so who cared, "The Rise of the Emoji 😎❤✌" is a relatively modern phenomena, and in many parts of the world, relatively few people use the languages that require straying from the BMP.  But now, it is a big deal, I can't believe what a high percentage of text I see written by non-programmers liberally sprinkled with non-BMP emojis everywhere.
     
    Jesse Silverman
    Bartender
    Posts: 1737
    63
    Eclipse IDE Postgres Database C++ Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Yes, but why are you thinking of your string as having a certain length in the first place? Why is that interesting at all?

    To be safe


    To be safe, stop writing code that needs to know about the length of strings.


    LOL, the same reason Java has many of its problems!  I came from C/C++.

    Null-terminated C strings, both on the stack and the heap, caused so many bugs that relentlessly fixing them everywhere (usually in code written by others) pretty much bought me  my first house.  Those are deadly even in pure 7-bit ANSI/ASCII, and exposing them is like giving a handgun to a 5 year-old...

    In particular, getting confused about whether counts were of chars or bytes caused tons of crashes, particularly with Windows API's, but also elsewhere.
    When using "advanced debugging aids" on my code running on Windows, they would identify places even in internal Windows API's where such mistakes were made, and the only reason we didn't see crashes/corruptions 500 times a day was because they had cautiously over-allocated buffers far larger than normally needed.  If you got up to half the space you thought you have, they could start crashing.

    But "Hooray!" at this point I can let down my paranoia somewhat, because we aren't going to get crashes or corruptions due to confusions about lengths in Java, it is just a long-held hard-to-break habit for me in this regard.  [And sometimes I *am* back in C/C++].

    What you wrote does seem to provide a great field guide to all this, I would point out that it seems like 0% of the materials I have consulted does so equally well (in particular, I have only briefly and superficially gone thru this part of the Java Tutorials, but looked at quite a lot of text and video talking about how to use Strings in Java, and almost none of them emphasize the safe and healthy way of thinking about things that you have carefully laid out.  A very high percentage of people who "kinda sorta" know Java have no clue about this stuff, including but not limited to the now-important distinction between Java char and "General Unicode Character".)

    Even if two apparently equal strings use the same encoding internally, they can still have different hash codes because strings may not be normalized. I can form the abstract character sequence "é" by using the code point "Latin Small Letter E with Acute", or as a combination of the code points "Latin Small Letter E" and "Combining Acute Accent". What about the abstract character sequence "Spaß"? This sequence can have a length of 4 or 5, depending on which string normalization algorithm you use, and differing hash codes to go along with it.


    Agreed.  There are some "general Unicode gotchas" mostly involving multiple representations for what every sane non-programmer just considers the same characters, and everyone doing Unicode work in any language (or just a Platonic Whiteboard) needs to be aware of.
    I was mostly concerned over the weekend in how to live life in Java without tripping over differences between char and code point and falling down the stairs.  I think you did an admirable job of throwing together a guide for that.

    I was also thinking about how an API built around UTF-16 may have some implementation simplifications because every code unit is exactly two bytes, regardless of data, making collections of code units "RandomAccess" essentially, but with complications in use around the fact that some "code points" are two "code units" and others are one.
    UTF-8 on the other hand, has no consistency of "code unit" size, as they can be 1, 2, 3 or 4 bytes each, making code units NOT "RandomAccess", however, the mapping of "code point" to "code unit" is trivial, you never have multiple code units for a code point.
    I borrowed the term "RandomAccess" that is used to describe the property that ArrayLists have that LinkedLists don't.
    It is an implementation detail, but I was thinking that it is probably easier to design an API that hides all sorts of junk using UTF-8 and allowing users to think solely in pure code points, whereas with UTF-16 being exposed as in Java one does need to work with code units.

    I will come back to your guide posted this morning when I find myself either confused or stressing about any of these things.

    I am guessing that lengths in embedded format string specifiers:
    public PrintWriter printf​(String format, Object... args)

    refers to Java char counts and not to "abstract characters" or bytes, right?  Or maybe not.  My confusion is GREATLY diminished, just not to zero.  I think it does matter here.
     
    Jesse Silverman
    Bartender
    Posts: 1737
    63
    Eclipse IDE Postgres Database C++ Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    I believe that this code snippet:



    Demonstrates that you do need to think of printf format widths, at least for strings, as in terms of Java chars/code units, NOT abstract Unicode characters, or you will step in it:
    ]ↇ💩💩💩💩[
    ]ↇ💩💩💩💩[
    ] ↇ💩💩💩💩[
    ]           ↇ💩💩💩💩[

    printf() is also a hold-over from C/C++ of course.
    I think this is an exception to "Will you stop thinking about lengths all the time??  You are in Java now, you can relax!!"
     
    Saloon Keeper
    Posts: 25625
    183
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Just a general note. It has long been a habit for people to use the term "byte" and "character" interchangeably. That was never a good idea, but languages like C were really sloppy about making the distinction. A character is a single code point value, no matter how many bits are in it. A byte is either the smallest uniquely-addressable element memory or a fixed-length subdivision thereof. Historically the term "word" referenced singly-addressable units, but IBM, Motorola and others used "word" in reference to register sizes, so most machines today are byte-addressable.

    A string, then can have two dimensional values which may or may not be the same value. One is the number of characters in the string, which is usually referred to as string length. The other is the number of memory units (e.g., bytes) required to hold the string. This is the string's "size".

    Java is an abstract language, so actual String sizes, character array sizes, and similar beasts are unknown, much less their overhead size for being objects. You can convert, via codepage translation to/from byte arrays, but it's not the same thing.
     
    Jesse Silverman
    Bartender
    Posts: 1737
    63
    Eclipse IDE Postgres Database C++ Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Tim Holloway wrote:Just a general note. It has long been a habit for people to use the term "byte" and "character" interchangeably. That was never a good idea, but languages like C were really sloppy about making the distinction. A character is a single code point value, no matter how many bits are in it. A byte is either the smallest uniquely-addressable element memory or a fixed-length subdivision thereof. Historically the term "word" referenced singly-addressable units, but IBM, Motorola and others used "word" in reference to register sizes, so most machines today are byte-addressable.


    I grew up in that barrio, and Stephan is helping me adjust to life in the Java Suburbs.  Note, real sticklers (who use the term denary for base 10 ints, no doubt) often insist on octet to mean a unit consisting of 8 bits, because byte may or may not be 8-bits. Of course, they may annoy regular plain old folks by doing this, and it seems that everyone else uses byte to mean 8 bits.

    Tim Holloway wrote:
    A string, then can have two dimensional values which may or may not be the same value. One is the number of characters in the string, which is usually referred to as string length. The other is the number of memory units (e.g., bytes) required to hold the string. This is the string's "size".



    But there are three, one of which we almost never care about, and one I am trying hard to care less about when working in Java.
    1. Actual number of bytes --> I am willing to forget about this in Java, because that's someone else's problem, not the programmer's.
    2. The number of Abstract Characters or Code Points --> I am trying to spend less time thinking about this than I had been, but sometimes this does matter.
    3. The number of Java char elements, or Code Units --> I just showed an example with printf formatting for strings where this is the one you are thinking about, not either of the two you mentioned.

    Java took us away from the bad old habit of equating "character" to "byte" but now we need to distinguish, whenever anyone strays outside the Basic Multilingual Plane, between chars (code units) and Abstract Characters (code points)...I am working on worrying about this the exact right amount, neither too much nor too little.

    The number of Abstract Characters in a Unicode string doesn't vary based on encoding, and encoding could be ignored.  But the number of code units depends on its encoding, so if you are counting Java chars (which you should stop doing except places you actually need to for some reason) then it matters that Java String is composed of UTF-16, with each character either one or two Java char units.

    Which is to say, I dealt with the old problem you mentioned for many years, hooray, it is over, but I am dealing with the new one now which is here to stay (tho as Stephan kindly showed, not as bad to live with as I thought).
     
    Tim Holloway
    Saloon Keeper
    Posts: 25625
    183
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    The word "octet" is extensively used in formal specifications such as ASN.1 and network specs in general. The word "byte" carries too many unfortunate connotations - not only in bit count, but also in byte ordering, thanks mainly to DEC and Intel. Some early IBM equipment used 6-bit bytes give or take, and in fact, the famous IBm "green screen" terminals of the 1960s were actually 6-bit devices even after they were being hooked to 8-bit byte mainframes (which is one reason why lower-case wasn't too common back then. Octets in data transmission are always 8 bits and never swapped, though the data packed into them may feel differently.

    I'd also be careful with the term "coded". Specifically the number of bits required to represent a character in a given code set as opposed to the number of bits of RAM actually used to store it in RAM, which can vary at the VM's will. For example, a given "A" in the sequence AAABCCB might be 7 ASCII bits or 8 EBCDIC bits, but if the string itself is stored in RLE (Run Length Encoded) form, 2 consecutive A's and 10 consecutive A's might both consume the same amount of RAM.
     
    Jesse Silverman
    Bartender
    Posts: 1737
    63
    Eclipse IDE Postgres Database C++ Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Going back to my example of my confusion, which the crew here is doing a great deal to alleviate, here is something I find slightly counter-intuitive, probably because I am bringing my old C/C++ expectations with me as I move to Java.

    I mentioned I kept wondering whether sizes meant code units or code points, and Stephan has been talking me down off the ledge, pointing out how often it just doesn't matter in Java, as long as we use them consistently, and I should really just relax.

    I understand the distinction between data and presentation, and that "column formatting" ruled the world from the 50's thru the 90's, and is sort of old-fashioned anyway, and only means much when dealing with monospaced fonts, etc.

    But I must have coded things to line up by column between two dozen and two hundred times, sometimes for exams and sometimes per spec at work, and is something I think of when I see a printf() variant in Perl, Python, Java, C# or anywhere else for that matter.  I am not *quite* ready to give it up.

    I had been wondering what the count in %<number>s format specifiers was measured in.  I reasoned that it couldn't logically be in Java char count (i.e. code units) because that would break 70 years of formatting by column count.  Well, Batman, it does:



    Yields, in my Windows 10 Eclipse set-up running Java 14:
    ]ↇ💩💩💩💩[
    ]ↇ💩💩💩💩[
    ] ↇ💩💩💩💩[
    ]           ↇ💩💩💩💩[
    ] ABCDE[
    ]   ABCDE[
    ]     ABCDE[
    ]               ABCDE[


    So, as you can see, I am having trouble getting sh*t to line up with printf().
    I might expect that in UTF-16, but my old column counting ways are doomed in UTF-8 as well, it appears, unless I screwed sh*t up.

    EDIT -- I may also be confused about the proper use of the monospaced tag in this forum, I feel like Vinny Barbarino in "Welcome Back Kotter".
     
    Jesse Silverman
    Bartender
    Posts: 1737
    63
    Eclipse IDE Postgres Database C++ Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Stephan van Hulst wrote:

    Jesse wrote:Everyone seems to be able to work fine if every character is below 0x80, most seem to work pretty well with values less than 0x10000, but tons of people get confused when dealing with Strings that contain characters outside the BMP.


    This is because char is not an abstract character, but rather a UTF-16 code unit. Let's take a look at which methods of the String class will lead to confusion if the string consists of surrogate pairs. You already discussed many.

    [summarization of Jesse's list elided]

    This seems like a lot, but if we make the argument that char is the cause of all evil and train our developers to think "Here be dragons" whenever they see or use char, we can remove all methods that take or return a char. We're left with:

  • codePointBefore​(int index)
  • codePointCount​(int beginIndex, int endIndex)
  • indexOf​(String str)
  • indexOf​(String str, int fromIndex)
  • length()
  • regionMatches​(boolean ignoreCase, int toffset, String other, int ooffset, int len)
  • regionMatches​(int toffset, String other, int ooffset, int len)
  • startsWith​(String prefix, int toffset)
  • subSequence​(int beginIndex, int endIndex)
  • substring​(int beginIndex)
  • substring​(int beginIndex, int endIndex)

  • ...
    We can eliminate all methods that you usually only use in conjunction with other method calls on the same string object.

    To me, that leaves length(), which is sometimes used in debugging or I/O, and which is probably the primary method that causes confusion for English/American programmers. This method by its very nature is confusing, regardless of what "string atom" you've chosen for your language: Is it a number of code points? A number of visual glyphs? Something else? A string method with this name will always be ambiguous.

    Only once you start working with indices and chars retrieved from the CharSequence using its methods. In general, you should think of CharSequence as an abstract character sequence, just like String. For instance, a StringBuilder in itself isn't dangerous. It becomes dangerous when you start appending chars individually, without treating surrogate pairs as atoms. Even the reverse() method of StringBuilder does things properly.



    I've been eagerly studying your response in hopes of later writing my memoir "How I learned to stop worrying about code units and love the Java String API".

    You did two passes of eliminating concerns, everything that explicitly passed a char and everything that didn't.

    The following should have survived the first cut, and only have been removed in the second cut, tho I would argue that the fact we missed them does show the API is at least a *little* tricky:

    Despite the variable name of ch, these all take Unicode Characters / Unicode Code Points for the variable named ch, but....the indexes are in code units.  Maybe that was what disqualified them?

    I do see why they would get removed in the second pass, we don't care what an index means as long as we are only passing it around from API call to API call...

    I think I will likely spend a few days catching up with all of the enormous amount of functionality in the Character class, but I do notice that there are many, many methods that take either a Java char or a code point (int), as well as some that only take a code point...it doesn't look like there are any serious gaps in Java's Unicode implementation, as I had seen some partisan sources partial to other languages suggest might be the case, but general programmer understanding of it and mine in particular fall a bit short.
     
    Tim Holloway
    Saloon Keeper
    Posts: 25625
    183
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Whoops.

    You've confused text rendering for text encoding.

    Text is rendered by the graphics engine that runs your display terminal and printer (there's a lot of shared code there). How things line up will depend on whether you use logical or physical tabs (or tabs at all) and whether you use a monospaced font or a proportional font.

    Back in DOS days using "glass TTY" terminals or equivalents - or typewriter devices - there often wasn't a choice of fonts. The font was hardware-encoded into the terminal itself in some cases. And it was monospaced.

    Monospaced fonts are generally not as pretty or as readable as proportional fonts. In fact, my unofficial name for the Courier New font was "Courier ugly". Most displays these days are set to use proportional fonts. As are most of the fonts we use on the Ranch,

    Inn a proportional font, a space, an "a", an "A", and an "m" are often different widths. So just spacing stuff out often disappoints. Hardware tabs can get around that (or CSS properties on web pages), but not every rendering medium supports hard tabs and not all of them allow setting tabs to whatever spacing you want. For Unix/DOS, the traditional hardware tab was supposed to be equivalent to 8 spaces, but in programs like Microsoft Word you could set them any way you liked.

    There's a lot more that could be said, but basically, it's up to what font your display device/window is using.
     
    Jesse Silverman
    Bartender
    Posts: 1737
    63
    Eclipse IDE Postgres Database C++ Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Tim Holloway wrote:Whoops.

    You've confused text rendering for text encoding.

    Text is rendered by the graphics engine that runs your display terminal and printer (there's a lot of shared code there). How things line up will depend on whether you use logical or physical tabs (or tabs at all) and whether you use a monospaced font or a proportional font.

    Back in DOS days using "glass TTY" terminals or equivalents - or typewriter devices - there often wasn't a choice of fonts. The font was hardware-encoded into the terminal itself in some cases. And it was monospaced.

    Monospaced fonts are generally not as pretty or as readable as proportional fonts. In fact, my unofficial name for the Courier New font was "Courier ugly". Most displays these days are set to use proportional fonts. As are most of the fonts we use on the Ranch,

    Inn a proportional font, a space, an "a", an "A", and an "m" are often different widths. So just spacing stuff out often disappoints. Hardware tabs can get around that (or CSS properties on web pages), but not every rendering medium supports hard tabs and not all of them allow setting tabs to whatever spacing you want. For Unix/DOS, the traditional hardware tab was supposed to be equivalent to 8 spaces, but in programs like Microsoft Word you could set them any way you liked.

    There's a lot more that could be said, but basically, it's up to what font your display device/window is using.



    What I am at is neither the "rendering" nor the "encoding", but the formatting.
    Are you going so far as to say the width specifiers in printf format specifiers in Java, and perhaps all languages, have been rendered meaningless??  [See what I did there?]

    In my two examples:
    printf itself adds 5 and 15 space bars to "ABCDE" (5 Unicode chars) passed to the formatters with %10s and %20s respectively.
    printf itself adds 1 and 11 space bars to my pile of poo (5 Unicode Characters) passed to the formatters with %10s and %20s respectively.
    printf width specifiers, perhaps unsurprisingly, count Java chars, or code units.

    I am trying to break my addiction to caring whether things are counted in code units or code points, but this is another place where the fact that a Java String is UTF-16 comes shining/leaking thru.

    My wife spends a few hours a day ensuring (among many other things) that various accounting and billing software doesn't screw up the columns on hard-copy.
    They get two chances to disappoint her, formatting and rendering.

    Secondly, I thought that the tt tags gave us a monospaced font that would show things like column alignments, where that mattered more than looking nice.

    I remain confused on that point.
     
    Tim Holloway
    Saloon Keeper
    Posts: 25625
    183
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    There's a difference between space formatting (characters) and text formatting. Text formatting uses font metrics. Every character in a GUI font - including space characters - has specific metrics - width and height defined for it. Fonts designed for word processing will actually often define several space characters of differing widths, just like they do for dashes ("em dash", "en dash", and so forth.)

    Normally C-style print formatting will insert vanilla space characters (Unicode x0020) when you indicate space-padding. But when used in a proportionally-spaced display, you may not always get what you want. It's far better to use more precise positioning mechanisms in such cases.

    Also note that in HTML, multiple adjacent spaces and single spaces are all treated as only one space. To get multiple-spacing effects on a web page, you either have to use the non-blank space character (&nbsp;) or better, use CSS to position things.
     
    Jesse Silverman
    Bartender
    Posts: 1737
    63
    Eclipse IDE Postgres Database C++ Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    I am generally aware of all the issues you cited, but also obsessed with details, so I checked that I remembered how the software on this forum works:

    Testing the HTML         spaces  issue   in    plain text in a     monospaced font in   plain  text



    I appreciate the reminder to not presume naïve printf column formatting will yield expected results in HTML, but remain unsure about expectations of tt-tags usage on this forum.
     
    Tim Holloway
    Saloon Keeper
    Posts: 25625
    183
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    When writing in HTML, the <tt> tag was used to designate inline teletype text. It was intended to style text as it would appear on a fixed-width display, using the browser's default monotype font.



    From https://html.com/tags/tt/

    Which also says that the <code> element should be used since HTML 5. Although I always see <code> used <div>-style, whereas <tt> is definitely <span>.
     
    Stephan van Hulst
    Saloon Keeper
    Posts: 13982
    315
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Jesse Silverman wrote:I was also thinking about how an API built around UTF-16 may have some implementation simplifications because every code unit is exactly two bytes, regardless of data, making collections of code units "RandomAccess" essentially, but with complications in use around the fact that some "code points" are two "code units" and others are one.


    Tim and I had a very interesting debate about a similar topic a while ago, in which I think we both raised good points.

    I think the "sort-of" conclusion we arrived at was to have String represent an abstract character sequence (although we didn't discuss normalization) without exposing anything about its internals, and if you really really wanted to, you could convert the string to a collection of 32-bit primitives that represent Unicode code points. I'm not certain anymore.

    UTF-8 on the other hand, has no consistency of "code unit" size, as they can be 1, 2, 3 or 4 bytes each


    Be careful. UTF-8 DOES have a fixed code unit size: 1 byte. We say that some Unicode code points are encoded by up to four UTF-8 code units, but each individual code unit is always exactly 1 byte.

    We tend to give the term "code unit" more importance than it deserves because UTF-16 code units carry a lot of meaning. UTF-8 code units are practically meaningless by themselves, unless they happen to encode a Basic Latin character.

    I am guessing that lengths in embedded format string specifiers [...] refers to Java char counts and not to "abstract characters" or bytes, right?


    Yeah I'm pretty sure it refers to the number of UTF-16 code units in a string, which is extremely unfortunate. But again, in the grand scheme it is doomed to fail regardless of what definition of "length" you use, because any definition that is useful in one scenario will fail in another scenario.

    It would have been nice if they had different ways to express different kinds of lengths. I think when formatting objects as strings you usually want to use "number of glyphs", because you want to display it with a fixed amount of pixels in a monospace font. Even this is problematic though: The ligature "fi" consists of two abstract characters, but in many fonts is displayed as a single glyph.

    There will never be a single solution for all cases.

    Anyway, while writing this up I was reminded of the UTF-8 Everywhere manifesto. It might not be exactly about what we discussed here, but it's a great read if you're interested in strings and encodings.
     
    Stephan van Hulst
    Saloon Keeper
    Posts: 13982
    315
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Jesse Silverman wrote:So, as you can see, I am having trouble getting sh*t to line up with printf().


    Many monospace fonts don't have monospace glyphs for emojis.
     
    Tim Holloway
    Saloon Keeper
    Posts: 25625
    183
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Stephan van Hulst wrote:The ligature "fi" consists of two abstract characters, but in many fonts is displayed as a single glyph.



    No, ligature "fi" has Unicode value FB01. But in OpenOffice Writer, rendering the 2-character sequence "fi" causes the rendering engine to output the fi glyph.

    You can, however, explicitly insert a ligature character. The separate characters and the ligature character both render as the same graphic, but you can put a cursor between the "f" and the "i" when the document itself contains discrete characters. You cannot split the two when the actual ligature character is used. This differs from things like quotes, where the word processor actually enters a different character than the one you typed in. The actual mechanism for the ligature rendering seems to be Graphite Smart Fonts.

    It's all done with mirrors.
     
    Tim Holloway
    Saloon Keeper
    Posts: 25625
    183
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Incidentally, MOST - but not all - proportional fonts make all their digit characters the same widths. Because typesetting tables of numbers with variable-width digits can be really ugly. So unless the numbers are intended for primarily decorative purposes, they'll align.
     
    Jesse Silverman
    Bartender
    Posts: 1737
    63
    Eclipse IDE Postgres Database C++ Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Thank you Tim!

    So neither HTML processing of space bars nor differences between tt and code tags are responsible for my anomalies.

    It does seem that the otherwise monospaced font failed to implement monospaced 💩 properly
    💩💩💩💩
    AAAA




    And that may or may not be relevant to my original point.

    In monospaced contexts where actual printf() formatting is used to line columns up and still matters, everything is measured in code units in Java.  You need to know you are UTF-16.

    If we aren't going to be using any non-BMP characters in such a manner, then it is a distinction without a difference, because in UCS-2 a code point ⇆ a code unit.
    In UTF-8 the sizes of code units vary, but there is still a 1 :: 1 relationship between size in code points and size in code units.

    Using characters outside of the BMP in Java char or Java String, the number of code points is half the number of code units.  Java goes by code units, or number of Java chars.

    It is a comment on modern society that I learned the trivia that the world's most popular non-BMP Unicode character, taking 4 bytes in both UTF-8 and UTF-16 was the poop emoji.  That is the only reason I picked it.  I traced the poop back to its source and got similar results:

    and then put in some work and searched for some non-BMP characters likely to be present in a monospace font:


    The difference is about how printf formatting works in Java, rather than rendering issues (which complicated and confused things when I stepped in the 💩).
     
    Jesse Silverman
    Bartender
    Posts: 1737
    63
    Eclipse IDE Postgres Database C++ Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Be careful. UTF-8 DOES have a fixed code unit size: 1 byte. We say that some Unicode code points are encoded by up to four UTF-8 code units, but each individual code unit is always exactly 1 byte.



    OK.  If nothing else, I learned to stop saying that UTF-8 represents Unicode characters with a single, variable-sized code unit because...what you said.

    Also, many or most Monospaced fonts do not apply that to emojis.

    Oh, and some stuff about tags in general and in here.

    This is all not as simple as I wished it was, but not as bad as I feared it was.  I am asymptotically approaching a pretty full understanding.

    Thanks everyone for your time and consideration!
     
    Tim Holloway
    Saloon Keeper
    Posts: 25625
    183
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    One thing to note is that while all digits may be the same width in a proportionally-spaced font, that width may or may not be the same as the width of the standard space character.

    An even more important thing to remember is that most typesetting on GUI media is done using a "best-fit" system. For example, My default WP font is "Times New Roman". But I don't have that installed and as a result, the actual font being rendered on my screen is something like Vera Serif.

    The emojis are a little strange, though, since in theory monospaced fonts don't have individual character metrics like proportional fonts do (why waste space on the obvious?) But I don't think that an alternate font got swapped in for thee emojis - usually if a font lacks a certain glyph, you just get a default like a rectangular placeholder box. So    (that's not a glyph, BTW, it's a rendered IMG tag!)

    Ah. Found it. The emojist are being rendered via the "Twemoji Mozilla" font. The text is DejaVuSans Mono. Note that Your Mileage May very, because the primary font list for Ranch text in code blocks appears to be "Consolas","Bitstream Vera Sans Mono","Courier New",Courier,monospace .

    Note edits were made to the above statement to clarify and correcy bad typinh.
     
    Tim Holloway
    Saloon Keeper
    Posts: 25625
    183
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Oh, and no, Java doesn't do print formatting different than C or any other purely character-oriented system, whether Unicode, EBCDIC, ASCII, DBCS or whatever. If's the display rendering (GUI) that's giving you the most pain.
     
    Jesse Silverman
    Bartender
    Posts: 1737
    63
    Eclipse IDE Postgres Database C++ Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Tim Holloway wrote:Oh, and no, Java doesn't do print formatting different than C or any other purely character-oriented system, whether Unicode, EBCDIC, ASCII, DBCS or whatever. If's the display rendering (GUI) that's giving you the most pain.



    I'll buy the first part, possibly, but EBCDIC and ASCII had one byte = one character, except for the Japanese DBCS versions.
    So the rendering is the biggest problem in those then because it is the *only* problem...
    The display rendering (GUI) is of course always the most complex part, but the difference here:

    <br /> <br /> Is mostly not rendering, e.g. the first three widths look identical because they are padded identically. <br /> <br /> I would need to look at printf() formatting in the different modes in the different languages.  The issue can't come up in any of the SBCS implementations, because each character is just a character. <br /> None of the SBCS systems had the equivalent of surrogate pairs, by definition.  Maybe Python does the same thing or worse with its printf().  I believe I read that internally it picks between ASCII, UCS-2 or UTF-32, preferring the smallest representation in each case on a string by string basis (and preferring UTF-32 over surrogate pairs).  I was actually messing around a lot with Python before I recently came back here...thinking about it, if each string is pure ASCII 7-bit, pure UCS-2 or pure UTF-32, then this issue ALSO doesn't come up I think. <br /> <br /> I haven't tried in Perl, C#, C++ etc. yet, but it doesn't happen in Python 3.9 (both strings get 10 space chars pre-pended): <br />

    So I will agree/disagree.  All the languages do their formatting the same way, putting their pants on one char at a time, but in Python the length of abc_str and my_str is the same, in Java (and probably C++ using wchar_t types) one is len 18, the other is len 10.

    The predominant theme of this thread had become Stephan asking the musical question "Why do you care so much about lengths?  You can mostly ignore it."
    My first thought besides "Good, I can't wait" was "But what about printf()??"  One of the ways the UTF-16 encoding of Java's String type leaks thru is precisely here.  There are many others, but Stephan is convincing me to not stress about them.
     
    Tim Holloway
    Saloon Keeper
    Posts: 25625
    183
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Don't use the Ranch webpage to prove your point. Those mis-aligned examples are not all using the same font. And in fact, they're not even using monospaced fonts! The code tag was not the final authority used for rendering there.

    "printf" deals with characters and things it thinks are characters. Except in cases where characters get temporarily cast to ints. In C, 7-bit ASCII was treated like it was 8-bit, although some machines used parity on the 8th bit, print formatting didn't supply parity. When they added DBCS as a compiler option, if you had DBCS turned on then it dealt with DBCS characters and in all cases, the actual character values were meaningless to printf except when they were being using as format-control characters and not as literal text. All DBCS, regardless, even as you type in 8-bit ASCII source but Java String literals are not ASCII, they're Unicode.

    If C/C++ has extended itself beyond DBCS to variable-length characters, that falls outside the time I monitored every change in the language. Initially it did not, however, and that's the model that Java is using, since how Java stores a char internally doesn't count.

    Python is a whole different bucket of snakes. It seems to be schizophrenic about whether it wants to use the OS native character set or Unicode for text operations, and I've gotten into fights with it more than once. Perl is based on old-time C. For the sake of sound sleeping, I refuse to look at what JavaScript is doing until I must.
     
    Jesse Silverman
    Bartender
    Posts: 1737
    63
    Eclipse IDE Postgres Database C++ Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Tim Holloway wrote:Don't use the Ranch webpage to prove your point. Those mis-aligned examples are not all using the same font. And in fact, they're not even using monospaced fonts! The code tag was not the final authority used for rendering there.


    All warnings about not confusing what goes into the String objects themselves and the rendering by whatever later displays them are received and processed and comprehended.  Everything following only applies to what gets written into the String objects in each case, no rendering involved.

    Tim Holloway wrote:
    "printf" deals with characters and things it thinks are characters. ... All DBCS, regardless, even as you type in 8-bit ASCII source but Java String literals are not ASCII, they're Unicode.

    If C/C++ has extended itself beyond DBCS to variable-length characters, that falls outside the time I monitored every change in the language. Initially it did not, however, and that's the model that Java is using, since how Java stores a char internally doesn't count.



    I was confused about Java String Literals because of this confusing issue seen in 3.10.6 of the JLS:
    UnicodeInputCharacter:
       UnicodeEscape
       RawInputCharacter
    UnicodeEscape:
       \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit
    RawInputCharacter:
       any Unicode character


    So, a "RawInputCharacter" can apparently be *any* valid Unicode character, but I had already seen that UnicodeEscape must be between \u0000 and \uFFFF, requiring surrogate pair encoding by hand for anything outside the BMP.  So String literals are sort of Unicode, but if you are using escape sequences to represent anything, you must know that it is encoded as UTF-16 internally.  I kept hitting that.   In Python, you can just say:
    >>> poop = '\U0001F4A9'
    >>> poop
    '💩'
    In Java I was confused because this is not a syntax error:

    But it gets interpreted as escape for Unicode/ASCII value of 1 followed by the literal characters "F" "4" "A" "9".
    To escape poop in Java literals requires the \uD83D\uDCA9 UTF-16 surrogate pair encoding to be entered.
    That is why I was thinking of Java literals as "UTF-16" and as far as escapes go, still am.

    At the risk of Beating a Dead Moose:
    The display rendering (GUI) is of course always the most complex part, but the difference highlighted here is upstream of any of that:



    It's not about rendering, e.g. the first three String widths are identical because they are padded identically, the fourth is different because Java measures lengths in char units, not glyphs, not code points.

    Tim Holloway wrote:Python is a whole different bucket of snakes. It seems to be schizophrenic about whether it wants to use the OS native character set or Unicode for text operations, and I've gotten into fights with it more than once. Perl is based on old-time C. For the sake of sound sleeping, I refuse to look at what JavaScript is doing until I must.



    I had thought Python had "totally converted to Unicode" for version 3, however, the "default character set" on Windows has not changed, because of problems with using the Windows Console API to input text in UTF-8 (UTF-16 works fine but nobody wants that anymore).
    There's lots of stress and fighting there because some people want "UTF-8 by default" but there are Windows-imposed problems on delivering it on that platform.  I believe Java shares these issues, but has some of its own.  As seen, Python lets you input whole code points as escape sequences without any surrogate pairs, in fact, it doesn't use surrogate pairs internally at all in modern versions, but best of all, none of that is seen by the application programmer.  It also always measures length as number of code points, hiding the representation details away from you.
    (Default input and output streams can't default to UTF-8 on Windows because of limitations of Windows Console, not Python: https://bugs.python.org/issue44275
    )
     
    Tim Holloway
    Saloon Keeper
    Posts: 25625
    183
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Well, again, the critical distinction is the difference between character (data) output and character rendering.

    On my browser, the second example is rendered using the DejaVu Sans Mono (monospaced) font, but the first is rendered using proportionally-spaced fonts.

    In a mono-spaced font, spaces and characters are always the same width. In a proportionally-spaced fonts, every character - including space - has its own predetermined width. Often the space character is fairly narrow, especially relative to upper-case letters, which are almost always wider than the lower-case ones. Especially M and W.

    For example:
    aaaaaaaaaaX
    mmmmmmmmmmX
    AAAAAAAAAAX
    MMMMMMMMMMX
    WWWWWWWWWWX
    0000000000X
    2222222222X
             X

    The last one, of course, is spaces.


    Finally, just to muck things up further, some rendering systems may engage in some form of text justification, causing the spacing between characters to be affected.

    Note that in Real Life, my own "DOS Terminal" windows use a sans-serif monospaced font (by default, anyway) and don't have any fancy typesetting characteristics, so spacing out stuff using space characters is completely straightforward. Web pages, on the other hand, are ultimately at the mercy of the fonts installed on the client's machine and what the typesetting features their browser of preference implements. Which is why when you want text aligned properly on a web page, you should use layout tags such as <table> and pixel-precise positioning via CSS.
     
    Jesse Silverman
    Bartender
    Posts: 1737
    63
    Eclipse IDE Postgres Database C++ Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    I totally appreciate the focus on the rendering side, which is probably more important and more widely applicable in general, not just to work in Java.

    Everything I was worrying about in this thread was specific to Java/UTF-16-based String API's until I went down the "Even monospaced fonts usually don't have monospaced emoji's" side quest, which was an important thing to learn.  I was generally aware of the HTML-rendering issues, but appreciated the refreshers.

    Back to my education by Stephan:
    I now see that you can just think of a String as abstract characters, and not worry about the implementation detail of it actually being composed of UTF-16 chars inside except in all of the many places that you do need to be aware of it due to leaky abstractions.  In particular, it is very important not to confuse a "character" with a "char", and even more confusing that the wrapper version, "Character" wraps a "char" not a "character" (which I believe could be wrapped only with an Integer).  That's not confusing at all!

    😉🤷‍♂️😎
     
    Stephan van Hulst
    Saloon Keeper
    Posts: 13982
    315
    • Likes 1
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Sorry, I kind of tuned out when the post count suddenly exploded.

    Tim, I think the problem that Jesse is trying to demonstrate is that the width he specified when formatting his string full of emojis is expressed as a number of UTF-16 code points, not a number of abstract characters or glyphs. He did this in response to my general advice to stop worrying about the length of strings.

    My response to this is that formatting data is inherently more difficult than people think. When you specify a width in a format specifier, what is it really that you want to express? Usually such a width is used to align cells of different rows with each other, so you would want to express the width as a number of glyphs in a monospace font. There are several issues with this, the most important of which Tim has pointed out: The amount of glyphs that your string is rendered as is decided by the font/rendering engine. For every output system where a certain format string will work correctly, I can hook up a different output system that will render your string horribly.

    My advice when it comes to printf() (or printing strings in general) is to only use it for debugging or for data where alignment isn't an issue. I know this flies in the face of everything I've said before, but this method is mostly used to output basic Latin. It's a product of its time.

    Typesetting is HARD. It would still be hard if we had a single universal character set that had a unique code point for every possible thing that any person could consider a single glyph. Don't expect an old holdover method from C++ to hide this complexity for you. If you want to do advanced formatting, you might want to use a software package like LaTeX.
     
    Tim Holloway
    Saloon Keeper
    Posts: 25625
    183
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    The only reason why int's and char's can be so freely slopped back and forth is because C was originally implemented as a "high level assembly language". And most computers don't have an explicit character type in hardware. Even the venerable IBM "MVC" (MoVe Characters) instruction actually only moved raw bytes. Equating bytes and ints is another sin entirely. We'll skip that one.

    And I believe we discussed earlier that characters in the abstract don't have actual integer values, but rather indexes into their code pages for collating and conversion purposes. I think that that's a formal definition, in fact. Something that becomes most noticeable when dealing with character sets whose bit occupancy depends on the value and/or context of a character.

    But also, don't forget that the reason the emoji's weren't rendering at the same width as the characters was that the Ranch renderer was employing a different font for the emoji than for the text. It's not that monospaced fonts have magical "monospace-free" zones in them!
     
    Tim Holloway
    Saloon Keeper
    Posts: 25625
    183
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Stephan van Hulst wrote:Don't expect an old holdover method from C++ to hide this complexity for you. If you want to do advanced formatting, you might want to use a software package like LaTeX.


    The basic C (and Java) format capabilities format text data, not the display. We have entirely different APIs to render text in a graphical environment.

    Again, in the Real World, if you're not targeting GUI displays, the format of the data alone generally suffices. That is. most people presenting text in a terminal display will be doing so via a monospaced font. Most terminal programs only allow one font, since font selection can only be done on terminals via escape sequences, if at all.

    But there's no absolute guarantee. I can go to my terminal window preferences and select a proportional font in any size and style I want if it pleases me to do so. Since the terminal windows are managed by the Windowing system (GUI), the rendering of proportional-font text will then obey the basic proportional text rendering characteristics of my Windowing system. If I drop out of the Window GUI down to true text (fullscreen terminal)  mode, it's a different story.
     
    Jesse Silverman
    Bartender
    Posts: 1737
    63
    Eclipse IDE Postgres Database C++ Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Tim Holloway wrote:...

    But also, don't forget that the reason the emoji's weren't rendering at the same width as the characters was that the Ranch renderer was employing a different font for the emoji than for the text. It's not that monospaced fonts have magical "monospace-free" zones in them!



    Is this the stuff that "Windows Terminal" does which the 1978-style "Legacy Console" does not that was referred to in my Python bug report discussion?

    Windows Terminal provides a tab-based UI that supports font fallback and complex scripts, which is a significant improvement over the builtin terminal that conhost.exe provides.

     
    Tim Holloway
    Saloon Keeper
    Posts: 25625
    183
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Very likely. Although since I no longer run Windows I'm not really sure what its various consoles and shells support these days. Not like when it was a simple "DOS Box" and COMMAND.COM.

    Some terminals and text programs will allow you to copy-and-paste text from other sources while preserving the fonts, styles, point sizes, and so forth even when they have only one native (default) font. The clipboard itself has allowed that extra metadata since the very beginning, and it's simply up to the recipient to determine which of several possible forms to actually paste.
     
    Jesse Silverman
    Bartender
    Posts: 1737
    63
    Eclipse IDE Postgres Database C++ Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Even Windows Terminal now has all that jazz:
    https://docs.microsoft.com/en-us/windows/terminal/customize-settings/interaction

    I notice that my profile defaults to a color scheme that should be welcomed around here (well, at least its NAME should be welcomed):
    Campbell

    and the font I have been using for everything, because I am still thinking of my text output in terms of columns until Stephan cures me of it:
    Cascadia mono

    Thinking of printf() as deprecated will be difficult for me, first because I have been using it since before I was allowed to vote, and secondly because it (or mild improvements that stick to its original design minus a few flaws) is available in not just C/C++/Java/C# but also Python.  I was surprised for a while by how many multi-lingual Java programmers flocked to it, but am now used to it.  Even the advanced replacements in C# and Python still have places for column width specifiers in "improved" syntaxes.  It will take me some time to think of that all as "legacy" when doing console output on any platform...

    I think I am done with this topic until and unless something bites me in the butt and I need to ask about it.

    I expect that will be after I am done with serialization when I seriously look at Console input options in Java.  I think I have already seen issues with (Windows-specific) UTF-8 input there (that transcend language choice), in addition to hearing about them.

     
    reply
      Bookmark Topic Watch Topic
    • New Topic