• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
  • Mikalai Zaikin

unicode query

Posts: 13
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

The question is:

Which one of the following are not valid character contants? [8]
Select any two.
(a)char c = '\u00001' ;
(b)char c = '\101';
(c)char c = 65;
(d)char c = '\1001' ;

and the answers are (a) and (d) . The thing is that i am not clear with the unicode notation and this tyoe character constants.Please suggest me a link on which i can read and understand the explanations of this answer and also "unicode".
Ranch Hand
Posts: 694
Mac OS X Eclipse IDE Firefox Browser
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The word "Unicode" means literally, "one" "code".

I believe that the history of encoding characters that is relevant to modern day computers is the teletype.

A standard code was agreed upon called ASCII:

American Standard Code for Information Interchange (ASCII)

ASCII was only a seven-bit code. When computers were invented, different manufacturers standardized on the first seven bits of a byte as being ASCII. Computer companies didn't want to waste the eighth bit. 8 is a power of 2 and since digital electronics is based on base-2 math, there was a bit left over that different computer companies used in different ways.

This difference led to having many different "code pages".

Unicode was proposed as a new standard to replace ASCII and *all* of the many code pages that existed for all of the symbols in all of the languages of the world. Originally, Unicode was a 16-bit standard which yielded 64K characters in the character set which was thought to be big enough to encode everything.


Java supported the early Unicode Standard from the beginning so, in Java, a char is 16-bits.

In Java, a char is assignment-compatible with an byte, a short, or an int, but the difference is that a char is a 16-bit unsigned integer and an short is a 16-bit signed integer.

char c1 = 65; // This is legal. It assigns a positive integer literal to a character, 'A'
char c2 = '\u0066'; // This is legal. It assigns c2 to be a capital 'B', using the Unicode escape form.
char c3 = 'C'; // This is legal. It assigns c3 to be a capital 'C', using a char literal.
char c4 = '\u00067'; // This is *NOT* legal because a char only has 16 bits so you have 4 hex digits to work with and this example has 5 hex digits which cannot be held in a char variable.

char c5 = '\101'; // This seems to compile, but I don't understand it. I would use c6 instead.
char c6 = '\u0101'; // This seems to me what c5 should be (but I'm not sure why c5 even compiles).

char c7 = '\1001'; // This is illegal. The 'u' is missing from the Unicode escape sequence
char c8 = '\u1001'; // This is legal. The missing 'u' from c7 is supplied here.

Unicode turned out not to be as simple as having a single encoding for all characters. Ror example, developers didn't want to use 16 bits to transmit characters over the Internet when 8 character encodings were twice as fast. So variations of Unicode exist, namely UTF-8 which is mostly 8 bit characters but some character encodings are longer.



In my opinion, Unicode is good because it does simplify things over the older standards where you could only use one code-page at a time and were therefore limited to 256 chars. Unicode 1.0 through 3.0 are much better, allowing up to 64K characters to be encoded in a 16-bit char.

Unicode versions 4.0, 5.0 broke the 16-bit limit, but is only an issue relatively rare characters for Asian languages that don't fit into Unicode 3.0.

You can use the char type for Unicode 1.0 through Unicode 3.0.

If you want to break the 16-bit limit for chars, you use an int which is 32 bits, 21 of which are used by Unicode nowadays.

So Unicode literally means "one-code" but it is not one code. it is better than having innumerable code-pages that you could only work with one at-a-time, but Unicode really can be encoded in a few different ways (such as UTF-8, UTF-16, and Unicode 4.0, and Unicode 5.0).
[ February 28, 2008: Message edited by: Kaydell Leavitt ]
Posts: 43081
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If you're new to the world of Unicode you might start by reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) as an introduction.

And if you're not confused enough yet, have a read of this post and this post, both of which explain what happens if one character isn't the same as one char. That's advanced stuff, though, but it's good to keep in the back of your head the fact that this can happen.
[ February 28, 2008: Message edited by: Ulf Dittmer ]
    Bookmark Topic Watch Topic
  • New Topic