Sometimes the only way things ever got fixed is because people became uncomfortable.
Which is why OP has been getting the compile time error. Of course, new String(new int[]{1, 2, 3}, 0, 3); would be invisible since those are non‑printing characters. Who knows what would happen if you include 4 or 26 because those are end of transmission control characters.Paul Clapham wrote:. . . in an alternate Java universe there could be the constructor
but in our universe there isn't.
Campbell Ritchie wrote:Would you regard a CodePoint class as a wrapper for an int or similar, Stephan?
Stephan van Hulst wrote:The strongest case against int is that some values may not even represent existing or legal codepoints.
Stephan van Hulst wrote:Not an int but a byte sequence that encodes a single Unicode codepoint in UTF-8. This of course is an implementation detail.
Wikipedia wrote:any of the numerical values that make up the codespace
Sometimes the only way things ever got fixed is because people became uncomfortable.
Tim Holloway wrote:Using codepoints to assemble a string requires you to know specific encoding values for a specific character encoding.
Unlike many languages, Java doesn't just allow you to shovel bytes into a character string
That Java supports codepoints at all is because it understands that sometimes you have to work with interfaces that don't deal directly with characters so you need something that can translate from raw bits.
For internal use, Java's character and string classes can generally deal with codepage translations without resorting to direct codepoint manipulation.
Sometimes the only way things ever got fixed is because people became uncomfortable.
Tim Holloway wrote:Assuming that it's comprised of bytes, or indeed of any concrete quantity defeats the purpose.
It not only constrains the size of the codepoint space, it assumes the ordering of the bits.
And the means by which they are ordered (endian-ness). And even today, there are still machines that enumerate bytes in continuous order rather than bytewise-discontinuous order. To say nothing of when you serialize them into bitstreams.
It is reasonable to accept that a codepoint is a numerical value in the domain 0..infinity, although practicality needs allow us to subset that into the Java integer space.
Sometimes the only way things ever got fixed is because people became uncomfortable.
Tim Holloway wrote:Unicode codepoints are codepoints into the Unicode character set. Java may internally work with Unicode, but it does not otherwise prefer one character set over any other.
And again, the ASCII family is not the only codepage out there. EBCDIC is still very much alive and well, for example. That's not even allowing for legacy support where "bytes" could be 12 bits or similar weirdness.
The Unicode nbsp character designation isn't a codepoint. It's an Entity Name for the Unicode non-break space character. Entity names can resolve to different codepoint values depending on the code page they target.
Tim Holloway wrote:They are usually going to be a transient intermediate stage between some sort of external media and/or foreign character set and Java's own Unicode objects.
I'm very much into type safety, but attempting to make codepoints distinct object types strikes me as way more trouble than it's worth. You could argue about making codepoints be a primitive - and in fact, you could actually do so in Ada - but Java hasn't gone that route and I don't expect it to.
Sometimes the only way things ever got fixed is because people became uncomfortable.
Tim Holloway wrote:
I'll split the difference on codepoint values. Putting a range limit on them means potentially limiting the size of codespaces. Unicode doesn't even take the full potential 16-bit range itself as I recall
Sometimes the only way things ever got fixed is because people became uncomfortable.
Wikipedia wrote:The Unicode Technical Committee rejected the Klingon proposal in May 2001 on the grounds that research showed almost no use of the script for communication, and the vast majority of the people who did use Klingon employed the Latin alphabet by preference.
They seemed to have a backup plan for extending Unicode1 but that took 0x0800 codes out of use (0xd800...0xdfffinclusive). Since there are unused codes in many of the tables, that would reduce the available total more. That Klingon proposal for example would have used 0x30 (48decimal) codes to store 0x25 (37decimal) characters.Mike Simmons wrote:. . . previously revised its own standard . . . I'm not sure if they have a backup plan . . . . I wouldn't mind excluding negative numbers though.
Backwards compatibility. Why do we have datatypes like float, char, and short in the first place? Do they have any use nowadays? Are they only there for completeness' sake, or because they reflect the gamut of datatypes in C? And why didn't they ever implement the unsigned keyword? Is there anything you can do with chars that you can't do just as well with ints, and confine yourself to the codePointAt(int) method if you use Strings? Backwards compatibility to the days when 16‑bit computers were still in use and a 1GB hard drive cost 50× what a 1TB hard drive costs nowadays means they have to maintain charAt() and similar.. . . we still have methods like the current charAt() . . . But, that's where we are.
Campbell Ritchie wrote:Backwards compatibility. Why do we have datatypes like float, char, and short in the first place? Do they have any use nowadays?
Mike Simmons wrote:Ideally (for me), a proper modern String class would have charAt() return a code point as an unsigned int, and there would be no legacy method that returned a char. There would be no char type.
Sometimes the only way things ever got fixed is because people became uncomfortable.
Tim Holloway wrote:
Mike Simmons wrote:Ideally (for me), a proper modern String class would have charAt() return a code point as an unsigned int, and there would be no legacy method that returned a char. There would be no char type.
Why? The whole point of characters is to avoid all the messiness that I recounted - from personal experience - from FORTRAN. What you're advocating isn't charAt(), but codePointAt(), which is trivially done in Java by casting the charAt() return value to an int.
Sometimes the only way things ever got fixed is because people became uncomfortable.
Campbell Ritchie wrote:Maybe if I read MS' code, it will have the hex values in…
Sometimes the only way things ever got fixed is because people became uncomfortable.
Campbell Ritchie wrote:Do you get 19 chars for 11 code points in UTF‑8 or UTF‑16? What are the hex values of those two numbers? I can't read decimal
Sometimes the only way things ever got fixed is because people became uncomfortable.
Tim Holloway wrote:Consider: the bulk of Western text strings can be represented by the bottom 256 codepoints of Unicode. I could pack them into bytes, rather than halfwords, typically saving a lot of space and only a small penalty for access time.
Mike Simmons wrote:It's 19 chars as Java chars. As bytes, that apparently converts to 40 bytes under UTF-16 (I was expecting 38; not sure why 2 more) or 41 bytes under UTF-8.
Tim Holloway wrote:Consider: the bulk of Western text strings can be represented by the bottom 256 codepoints of Unicode. I could pack them into bytes, rather than halfwords, typically saving a lot of space and only a small penalty for access time.
I could extend this mechanism to have a "run header", based on the concept that in many cases you might have primarily one segment, with rare outliers. Here I can continue the original space optimisation, but trade some access time for the ability to directly access the outliers, even if they required 16 or even 24 bits. Because I wouldn't have to scan from the beginning to randomly find a character without being thrown off by surrogates.
I think they did something like that in Java8. I also notice Paul C has already told us that more accurately.Tim Holloway wrote:. . . the bulk of Western text strings can be represented by the bottom 256 codepoints of Unicode. I could pack them into bytes . . . you could invent a MetaString class . . .
Campbell Ritchie wrote:But English isn't necessarily the most widely used language in the World. Chinese is the most widely used first language. Most Asian writing can't be coded with ASCII or similar.
The world's cheapest jedi mind trick: "Aw c'mon, why not read this tiny ad?"
free, earth-friendly heat - a kickstarter for putting coin in your pocket while saving the earth
https://coderanch.com/t/751654/free-earth-friendly-heat-kickstarter
|