Stephan van Hulst wrote:No, I expect you to come up with a simple example that shows that your assertions are more than just wild conjecture on your part.
I tire of this discussion.
Anil Philip wrote:
All this while I have been trying to say that a "word" should not occupy the same space as a "character" in the set, just because it is an ideograph.
Observe that you used 9 characters to name the ideograph: 'therefore'.
That ideograph belongs to graphics languages and not to character sets.
Stephan van Hulst wrote:
Maybe, but you still haven't suggested a primitive that is both easy to work with and easy to implement technically.
Stephan van Hulst wrote:Again, what are these primitives, according to you? For many ideographs, they can't be composed of letters or syllables, because they don't have a single or even a known pronunciation.
For simplicity's sake, let's take a symbol that is not tied to one particular language: ∴
This is the therefore sign. It doesn't have a single fixed pronunciation. It's just a symbol that expresses an abstract concept. How are you going to compose it from smaller primitives?
Mike Simmons wrote: My own limited understanding is that while some ideograms can be broken into constituent parts, many cannot. And even when they could be, the sum can be somewhat different than the parts, and this reductionist approach can miss important subtleties of how two ideograms may be different. Early versions of Unicode attempted to simplify things enough that all the Chinese, Japanese, and Korean ideograms could be fit into 16-bitcode points like Java's char - but this was ultimately unsuccessful, as native speakers pushed back and found that these representations were insufficient for all their use cases. So modern Unicode has expanded to many more characters than fit into 16 bits. If they don't fit your definition of "character", well, that's unfortunate. But I suspect that most of those language users will continue to view their languages in ways that make sense to them, not you, regardless of how you feel about it.
Paul Clapham wrote:
Anil Philip wrote:it should be up to the native language speakers to tell us what the primitives are in that picture language.
And then there are scripts where you can't just casually go out and ask native speakers... consider Korean for example. Who are you going to ask?
Stephan van Hulst wrote:
Anil Philip wrote:Just because a word is in a glyph as in Ron's example 狗 (dog) doesn't mean it should be a character.
What is a "character", according to you?
And if 狗 is not a character, from which characters do you propose we compose it?
Mike Simmons wrote: And yes, once again, these tricks are not obvious - but they are part of the standard repertoire of transformations available when using streams.
Stephan van Hulst wrote:
Anil Philip wrote:狗 (dog) should not be a character. It is a word.
Who are you to decide that characters may not correspond to ideograms? That's, in my opinion, a very colonialist way of thinking.
Ron McLeod wrote:
Anil Philip wrote:because emojis and gifs should be treated like 'words' and not 'characters'.
But when using Unicode representation, letters, numbers, punctuation, emojis, and other symbols are all characters, including characters which represent what could be considered words in alphabetic languages such as 狗 (dog) or 人 (person).
Ron McLeod wrote:
Anil Philip wrote:Emojis are drawings, built out of primitives.
Well, you could say the same for letters of the alphabet - all except maybe the letter I are formed using multiple primitives like straight strokes and arcs.
Paul Clapham wrote:
People routinely write messages which include both "ordinary" characters and emojis. I can't imagine why any designer would propose treating them differently when the people writing the messages don't.
Stephan van Hulst wrote:
Anil Philip wrote:ask them how to produce a stream of Character from a String
But why? What realistic use case would that serve? As I said before, there aren't many legitimate use cases for manipulating individual characters.
Campbell Ritchie wrote:AP: please avoid ”editing” old posts; that can make it confusing to know what was known when somebody replied.
Liutauras Vilda wrote: https://horstmann.com/unblog/2023-10-03/index.html
Stop Using char in Java. And Code Points.
TL;DR
Stay away from char and code points
Just use String
Treat indexOf/substring indexes as opaque values
Stephan van Hulst wrote:
So your standard for good API design is whether people have it memorized?
...
I consider myself a very talented programmer in many different languages, but in none of them do I have the API memorized. Even for classes that I use on a daily basis, I will have to refer to the documentation to find out what the methods do exactly.
https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/String.html#chars()A String represents a string in the UTF-16 format
https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/Character.html#isLetter(char)The Character class wraps a value of the primitive type char in an object.
public IntStream codePoints()
public Stream<Character> unicodes()
Stephan van Hulst wrote:IAnd the String class DOES have a method that converts a string instance to a ln IntStream: String.chars().