• Post Reply Bookmark Topic Watch Topic
  • New Topic

Confuse in UTF and Octal encoding  RSS feed

 
Puspender Tanwar
Ranch Hand
Posts: 499
2
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Ranchers,
I was playing with codes and came across some confusion.





When i studied for it, I came to know that java uses UTF-16 for java source code encoding. But I am unable to relate this to my issue.
If anyone can also provide some good resource for such knowledge, that would b great for me.

Thanks
 
Dave Tolls
Ranch Foreman
Posts: 3058
37
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The first one is because the compiler interprets any integer value as an int.

The second two are because that's how an octal is defined, and escape character followed by an integer up to (I think) 255.
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Puspender Tanwar wrote:When i studied for it, I came to know that java uses UTF-16 for java source code encoding. But I am unable to relate this to my issue.
If anyone can also provide some good resource for such knowledge, that would b great for me.

Well, the first one I can think of is the Character class itself.

Characters - and especially character encodings - are NOT simple; and there's a lot of history behind them.
Originally, computers only dealt with the English alphabet (52 characters), control codes, and and a few other common symbols like '/', '-' and '*', because they fit in a very small space (7 bits), which in turn fits nicely into a byte (8 bits). And for ages, there were two basic standards for encoding: ASCII (the 'A' standing for 'American') - used by a lot of early Unixes and (I think) DEC - and EBCDIC, which was used by IBM and ICL.

However, over time, especially with the advent of desktop systems, people wanted to see their own languages - French, German, Spanish, etc - represented, and these have a lot of diacritics or "accents" that English doesn't - 'é' is not the same thing as 'e' in French. Then there are the Greek and Cyrillic alphabets; and when you get to pictogram alphabets like Chinese, there are about 3,000 commonly used symbols.

Obviously, 8 bits can't cope with those sorts of numbers so, by the time Java became a reality, there was already a standard in place called Unicode, which used 16 bits (≈65,000 values) to cover most of the world's alphabets, and this was the one that Java opted for - which is why Java characters are normally TWO bytes (16 bits) long.

Problem is, with the advent of browsers, and HTML, even 16 bits isn't enough, so Unicode was extended to allow even bigger values, which is what UTF-8 and UTF-16 are all about.
It's possibly also worth mentioning that the 'TF' stands for "transmission format", since it's a format - or encoding - for streams of bytes (8) or characters (16) that you might receive from a file, or over a network or socket.

None of which gets you any closer, but hopefully provides a bit of background.
My advice: Google some things like "Unicode", "character encoding", "UTF-8" and "UTF-16", and concentrate on the Wikipedia articles, because they're generally very good.

HIH

Winston
 
Jesper de Jong
Java Cowboy
Sheriff
Posts: 16060
88
Android IntelliJ IDE Java Scala Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If your question is why the character code is in octal rather than decimal, then that is most likely because this way of escaping characters is a feature that Java inherited from C, which has had this feature for decades. And because someone in the 1970's thought that having octal character escape codes was useful, for some reason that nobody knows anymore.

edit - the relevant section in the Java Language Specification indeed says that this comes from C:
JLS wrote:
Octal escapes are provided for compatibility with C, but can express only Unicode values \u0000 through \u00FF, so Unicode escapes are usually preferred.

 
Campbell Ritchie
Marshal
Posts: 56541
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Dave Tolls wrote:. . . escape character followed by an integer up to (I think) 255.
377, surely? It shoul‍d be in the Java® Language Specification (=JLS).

Why are you using octal arithmetic in the first place? It has hardly been used for ages. Even that JLS link says to prefer \u1234 escapes. As Jesper says, the JLS says octal escapes are there for compatibility with older C‑like languages. I presume octal escapes were useful for characters like ß (=0x00df) or » (=0x00bb) which are included in the 0...255 range, but weren't available on many keyboards.
 
Dave Tolls
Ranch Foreman
Posts: 3058
37
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Campbell Ritchie wrote:
Dave Tolls wrote:. . . escape character followed by an integer up to (I think) 255.
377, surely? It shoul‍d be in the Java® Language Specification (=JLS).


I was working in decimal...

;)
 
Puspender Tanwar
Ranch Hand
Posts: 499
2
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you all.
Now I understood the point and will research over the wiki pages for some deep insight. But what I noticed is that I can only be able to print upto \u00FF only. Beyond that for every unicode value, output is '?' . Why I am not able be to print beyond \uooFF ?
 
Puspender Tanwar
Ranch Hand
Posts: 499
2
Java
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
problem solved : in eclipse go to Windows -> preferences -> general -> workspace and under text file encoding select UTF-16 or UTF-8.
 
Campbell Ritchie
Marshal
Posts: 56541
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Puspender Tanwar wrote:. . . . But what I noticed is that I can only be able to print upto \u00FF only. Beyond that for every unicode value, output is '?' . . . .
Where are you printing? The Windows® command line is notorious for being unable to render characters > 0x00ff, and even some “extended ASCII” characters come out oddly. For example £ is 0x00a3 but renders as ú on the command line. It has to do with the encoding used, which isn't Unicode but something beginning cp. If you haven't found anything really good to read about encodings, try Joel Spolsky.
 
Puspender Tanwar
Ranch Hand
Posts: 499
2
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Campbell. That's really a helpful link. Excellent explaination for a beginner.
But I have some doubts here,
That's where encodings come in.

The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let's just store those numbers in two bytes each. So Hello becomes

00 48 00 65 00 6C 00 6C 00 6F

Right? Not so fast! Couldn't it also be:

48 00 65 00 6C 00 6C 00 6F 00 ?

Well, technically, yes, I do believe it could, and, in fact, early implementors wanted to be able to store their Unicode code points in high-endian or low-endian mode, whichever their particular CPU was fastest at, and lo, it was evening and it was morning and there were already two ways to store Unicode


How 00 48 can be same as 48 00 as stated above? As per my knowledge these two are different Hex-numbers.
Next is, as told in the blog that in unicode encoding 2 byte are used for storing a code point. Please correct me if I am wrong, 0048 is stored in 2 bytes, right ? Means 00 is covering 1 byte and 48 is covering another byte, right ?
 
Paul Clapham
Sheriff
Posts: 22828
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Puspender Tanwar wrote:How 00 48 can be same as 48 00 as stated above? As per my knowledge these two are different Hex-numbers.


But if one of them is in a big-endian representation and the other is in a little-endian representation, then they represent the same value. (You might want to google those terms to find out the distinction they are describing.)
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!