I just have a small question that's bugging me regarding the API documentation on Sun's web site for the indexOf method of the String class that takes a char argument...why does the documentation have int listed as the arg data type:
public int indexOf(int ch)
Returns the index within this string of the first occurrence of the specified character. If a character with value ch occurs in the character sequence represented by this String object, then the index (in Unicode code units) of the first such occurrence is returned. For values of ch in the range from 0 to 0xFFFF (inclusive),...
ch - a character (Unicode code point).
Thanks very much in advance for your help...Catherine
In Java a char is one byte. However, some Unicode characters are encoded in two bytes. Allowing an int as input allows these characters to be passed to indexOf.
I believe I found the documentation in the Character Class API...
The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value...
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters.
A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points...
The methods that accept an int value support all Unicode characters, including supplementary characters.
Thanks again for your help...Catherine
[ November 16, 2006: Message edited by: catherine powell ]
Ummm... I can't really see how this is true. In Java, a char has a range from 0 to 65535, which requires two bytes. (Minimum) It's true that, if you encode a group of chars as bytes, using most common encoding schemes (ASCII, ISO-8859-1, Cp-1252, UTF-8), the most common (English-language) characters can be encoded in 1 byte per character. But that's not an absolute rule, and I think it's dangerously misleading to say that a char is a byte.
So, why do they use int int rather than char as the return type here? (For String.indexOf() as well as various other methods scattered in the standard API.) I think the reason is for convenience, given that int is the "default" type for most expressions, unless something in the expression forces the expression to be float or double instead. Because anytime you perform simple arithmetic, or even just write a plain literal like 1 or 42, Java assumes you mean an int. And if it's expecting a char rather than an int, Java gets pissy and blaks until you fix it. I think that the decision to define indexOf(int) rather than indexOf(char) is motivated by nothing more than the desire to save users from the mild annoyance of having to cast a result from int to char. Which is not a particularly compelling reason, in my opinion, but it's the best one I can think of.
So: chars in Java require two bytes. But Java often tends to assume that computations will need four bytes, and to accommodate this, indexOf() and other methods accept parameters of int type.
[ November 17, 2006: Message edited by: Jim Yingst ]
True. A char is still in the range 0-65535, but characters (code points) can have higher values. And in fact indexOf() does now make use of this, allowing you to pass in code point values larger than 65535. The interesting thing though is that indexOf(int) was the signature for this method from way back at JDK 1.0 (and probably earlier). I'm pretty sure that back then they weren't thinking about the possibility that Unicode might need to be expanded to wider ranges - if they were, then there are several other methods that should have been defined differently. Which is why String has now added methods like codePointAt() to supplement charAt().
So, it looks to me like the original reason to use indexOf(int) rather than indexOf(char) had nothing to do with the possibility of code points larger than 65535. But now, it turns out that they got lucky, because in fact it is possible to have higher values for characters - though not for chars.