Anil Philip

Ranch Hand
+ Follow
since Sep 06, 2003
Merit badge: grant badges
Cows and Likes
Cows
Total received
In last 30 days
0
Forums and Threads

Recent posts by Anil Philip

Stephan van Hulst wrote:No, I expect you to come up with a simple example that shows that your assertions are more than just wild conjecture on your part.
I tire of this discussion.


I tire too.
As I have said before, (here is how I would do it:)

Anil Philip wrote:
All this while I have been trying to say that a "word" should not occupy the same space as a "character" in the set, just because it is an ideograph.
Observe that you used 9 characters to name the ideograph: 'therefore'.
That ideograph belongs to graphics languages and not to character sets.



There should be a Words or Concepts dictionary for the XYZ language. name---->value lookup.
"therefore" ---------> {steps or instructions on how to draw it; how to compose it from the ideograph primitives OR if this is a primitive, then how to draw it using bitmaps or vector graphics}
3 months ago

Stephan van Hulst wrote:
Maybe, but you still haven't suggested a primitive that is both easy to work with and easy to implement technically.


Do you expect me to come up with a design just like that? These are highly paid professionals employed fulltime on it.
Unfortunately, if I could, I would be like them - which I am not. all I can do is point out "The Emperor Has No Clothes".

Stephan van Hulst wrote:Again, what are these primitives, according to you? For many ideographs, they can't be composed of letters or syllables, because they don't have a single or even a known pronunciation.
For simplicity's sake, let's take a symbol that is not tied to one particular language:
This is the therefore sign. It doesn't have a single fixed pronunciation. It's just a symbol that expresses an abstract concept. How are you going to compose it from smaller primitives?



All this while I have been trying to say that a "word" should not occupy the same space as a "character" in the set, just because it is an ideograph.
Observe that you used 9 characters to name the ideograph: 'therefore'.
That ideograph belongs to graphics languages and not to character sets.
(by the way, I have never seen anyone use this ideograph in the USA - not even in Math textbooks. So it is not universal.)
3 months ago

Mike Simmons wrote: My own limited understanding is that while some ideograms can be broken into constituent parts, many cannot.  And even when they could be, the sum can be somewhat different than the parts, and this reductionist approach can miss important subtleties of how two ideograms may be different.  Early versions of Unicode attempted to simplify things enough that all the Chinese, Japanese, and Korean ideograms could be fit into 16-bitcode points like Java's char - but this was ultimately unsuccessful, as native speakers pushed back and found that these representations were insufficient for all their use cases.  So modern Unicode has expanded to many more characters than fit into 16 bits.  If they don't fit your definition of "character", well, that's unfortunate.  But I suspect that most of those language users will continue to view their languages in ways that make sense to them, not you, regardless of how you feel about it.


I think you are too easily accepting of the "reasons" - if they didn't have to move from unicode to codepoints, then we could say it was well designed and satisfactory.
Perhaps all human skills or endeavors are "words" composed of primitives.
For instance, cooking - compound techniques (like blanching) are made up of simpler techniques (chopping, peeling sauteing, boiling...);
recipes are combinations of ingredients and basic operations.
Similarly with martial arts, music, or even driving a car.
So I don't accept that the ideographs cannot be broken down into primitives.
These are fellow human beings - not aliens from another solar system - their languages are bound to have primitives composed into words which represent concepts.
In fact, I wonder if the refusal may be political - I remember as a schoolboy growing up in another country,
listening to a lecture in assembly about how there was once upon a time, an attempt to move to a scientific date and time (base-10 decimals?)
but it was allegedly shot down by the Judeo-Christian lobby in America because the Sabbath would be eliminated.

3 months ago

Paul Clapham wrote:

Anil Philip wrote:it should be up to the native language speakers to tell us what the primitives are in that picture language.


And then there are scripts where you can't just casually go out and ask native speakers... consider Korean for example. Who are you going to ask?



What?! I would just go tomorrow (Sunday) and ask any of the Korean people at the Korean-American church we attend.
I did try to learn the Hangul a long time ago (and forgot it all) but I remember that it also has vowels-consonant primitives - and it does have ideograph picture symbols.
3 months ago

Stephan van Hulst wrote:

Anil Philip wrote:Just because a word is in a glyph as in Ron's example 狗 (dog) doesn't mean it should be a character.


What is a "character", according to you?

And if 狗 is not a character, from which characters do you propose we compose it?



In the 3 Asian/Middle-Eastern languages I mentioned, they have as primitives, vowels and consonants.
Words representing compound concepts are composed from these primitives.
To your second question, it should be up to the native language speakers to tell us what the primitives are in that picture language.
3 months ago

Mike Simmons wrote: And yes, once again, these tricks are not obvious - but they are part of the standard repertoire of transformations available when using streams.


Thank you for your solutions - they are really cool!
I did not know that

converts a codepoint into a character String.
3 months ago

Stephan van Hulst wrote:

Anil Philip wrote:狗 (dog) should not be a character. It is a word.


Who are you to decide that characters may not correspond to ideograms? That's, in my opinion, a very colonialist way of thinking.


"Colonialist"?
Hahaha!
The non-European languages I know are Hindi, Kannada, Hebrew (a little). None of them are colonial.
The former two are Indo-Aryan from Sanskrit roots. All three are ancient.
Each letter in all these languages is a primitive which doesn't stand for any concept, and words are composed out of these.
Just because a word is in a glyph as in Ron's example 狗 (dog) doesn't mean it should be a character.
3 months ago

Ron McLeod wrote:

Anil Philip wrote:because emojis and gifs should be treated like 'words' and not 'characters'.


But when using Unicode representation, letters, numbers, punctuation, emojis, and other symbols are all characters, including characters which represent what could be considered words in alphabetic languages such as 狗 (dog) or 人 (person).



狗 (dog) should not be a character. It is a word. No wonder they ran out of space!

Ron McLeod wrote:

Anil Philip wrote:Emojis are drawings, built out of primitives.


Well, you could say the same for letters of the alphabet - all except maybe the letter I are formed using multiple primitives like straight strokes and arcs.



emojis are drawings which are 'words' in the sense they represent concepts. They are not primitives. 'a' is a primitive. It doesn't represent anything.
Taylor Swift could come up with a unusual squiggle and post it on X-Twitter, then it becomes all the rage and in everyday use and before you know it, it is now allotted a character for itself - is this how it works?
And then 30 years later, she and her character are long forgotten and the character slot is wasted.
3 months ago

Paul Clapham wrote:
People routinely write messages which include both "ordinary" characters and emojis. I can't imagine why any designer would propose treating them differently when the people writing the messages don't.


because emojis and gifs should be treated like 'words' and not 'characters'. Words are secondary and composed of characters, which are primary. Emojis are drawings, built out of primitives.
Also new words come into usage and die out; characters are more permanent.
3 months ago
From what I understood, code-points came about because 'they' wanted to use emojis and add all sorts of graphics symbols to standard character sets that are used by world languages.
It makes me wonder why they did not have a separate character set(s) for graphics symbols.
3 months ago

Stephan van Hulst wrote:

Anil Philip wrote:ask them how to produce a stream of Character from a String


But why? What realistic use case would that serve? As I said before, there aren't many legitimate use cases for manipulating individual characters.


What about the original post - the coding question - do you think it's unrealistic?!
3 months ago

Campbell Ritchie wrote:AP: please avoid ”editing” old posts; that can make it confusing to know what was known when somebody replied.


The system will not allow you to edit a post after it has a reply. I only edited my most recent post. There were no replies on top of it.
3 months ago

Liutauras Vilda wrote: https://horstmann.com/unblog/2023-10-03/index.html


Thanks for the article by Cay Horstmann.

Stop Using char in Java. And Code Points.
TL;DR
Stay away from char and code points
Just use String
Treat indexOf/substring indexes as opaque values


3 months ago

Stephan van Hulst wrote:
So your standard for good API design is whether people have it memorized?
...
I consider myself a very talented programmer in many different languages, but in none of them do I have the API memorized. Even for classes that I use on a daily basis, I will have to refer to the documentation to find out what the methods do exactly.


No, you misunderstood.
It is about knowing that each of the members of the IntStream must be cast to a char and then mapped to an Object.
If average developers do not know this then the problem is with the API being abstruse.
And we're talking about a String of chars. It should have had a straightforward conversion API method.

A String represents a string in the UTF-16 format

https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/String.html#chars()

The Character class wraps a value of the primitive type char in an object.

https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/Character.html#isLetter(char)

String already has a method

public IntStream codePoints()


Perhaps it should have

public Stream<Character> unicodes()


?
3 months ago

Stephan van Hulst wrote:IAnd the String class DOES have a method that converts a string instance to a ln IntStream: String.chars().


I don't understand the issues you mention about codepoints. I have never used them.
But I do know that - at random, pick (say) 10 average Java developers and ask them how to produce a stream of Character from a String.
Don't allow them to look up the solution from the internet.
I will be surprised if even half of them will be able to say so off the top of their heads.
To me it is good API design vs bad API design.
3 months ago