• Post Reply Bookmark Topic Watch Topic
  • New Topic

What is a Unicode code unit and a Unicode code point?  RSS feed

Varuna Seneviratna
Ranch Hand
Posts: 170
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding

The above is from the API specification describing about Class Character.In this description Unicode code point is used to indicates characters like "A", "B", "C"?

Unicode code unit is used to indicate 16-bit char values, does that also means characters like "A", "B", "C"?.A char value also denotes a character isn't it?

I was ushered into the character class API documentation by the description of the length() in the String class.I want to understand what is a Unicode code unit.The length returned by length() is equal to the number of code units what is a code unit? Is it a character?

public int length()
Returns the length of this string. The length is equal to the number of Unicode code units in the string.

Specified by:
length in interface CharSequence

Ulf Dittmer
Posts: 42972
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You might want to start by reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for some background information.

As far as Java was concerned, a "char" was a Unicode character (or code point) up to Java 1.4. But then Java 5 introduced a new version of Unicode, and now one Java char is no longer necessarily the same as a Unicode code point. Luckily, the code points that take up more than one char are rarely used, but you still need to be aware of it. In particular, String.length() may not return the correct number of characters in a string. John O'Conner and Tom White blogged about this.
Gamini Sirisena
Ranch Hand
Posts: 378
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Allow me to add a few more inputs..

A Unicode code unit is a bit size used by a particular Unicode encoding.
For example UTF-8 has a code unit size of 8 bits and UTF-16 has 16
and UTF-32 has 32.
To represent a character (i.e. a code point, which is a Unique integer assigned
to each character) one or many code units may be
required depending on the encoding.

Java uses UTF-16 and this means the code unit size is 16 bits.
Unicode has over 1 million code points (10FFFF+1 in hex).
16 bits can represents only FFFF+1 code points.
(This range is called the BMP (Basic Multilingual Plane.
It contains all the commonly used character in the world and some more).

So to represent code points outside the BMP the UTF-16 encoding specifies
surrogate pairs. For this two special ranges are defined within the BMP.
In UTF-16 any character outside the BMP is represented by two 16 bit code units
in this range.
(In fact surrogate characters are defined only for UTF-16).
Now it should be clear that certain characters may require two code units in UTF-16.

So counting 16 bit code units will not yield the correct "length of characters".
String.length() returns the number of code units in the String.

Since 1.5 you can use codePointCount(int beginIndex, int endIndex) to get
the length of the characters.
It will count a surrogate pair as one character.
[ November 27, 2008: Message edited by: Gamini Sirisena ]
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!