• Post Reply Bookmark Topic Watch Topic
  • New Topic

int from char mapping  RSS feed

 
Abigail Decan
Ranch Hand
Posts: 65
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
when casting a char which is read from a file to an int, can i assume that the mapping used will be ASCII?
i've learned that unicode uses ASCII mappings for the characters that overlap.

are there any other possibilities for int values of one character?

i still have trouble understanding character encodings.
 
Paul Clapham
Sheriff
Posts: 22844
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You mean like this?



You can assume that the "other" variable now contains the Unicode code point of the "input" variable. If the char happened to be in the ASCII subset of Unicode (which is the first 128 of Unicode's 65536 code points) then yes, you get the ASCII value of that char. Otherwise there is no ASCII value of it. And so in answer to your other question, yes, there are lots of other possibilities beyond ASCII. Don't make the mistake of assuming that ASCII is almost everything and Unicode is only for weird stuff used by people you never heard of. It's the other way around, almost; Unicode is everything and ASCII is a small subset of it.

And character encodings have nothing to do with casting chars to ints (or almost nothing). The character encodings would have come into play when you read the char from the file; since a file is a sequence of bytes, not chars, the encoding would be used to convert that sequence of bytes into a sequence of chars.

(I said "almost nothing" there because it's actually more complicated than that. Unicode really has more than 65536 code points, so there's a mapping used to convert the code points beyond 65535 into int values. But those code points really are weird stuff used by people you never heard of, so don't pay any attention to that until you have the basics sorted out.)

 
Abigail Decan
Ranch Hand
Posts: 65
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
so there are actually two numerical values associated to a character: it's unicode code point and the encoded value in the file?
there's only one codepoint for the character, but depending on the file type, you can use various encoding method to encode the unicode into a byte.

if i got this right, i think i understand better now.

thanks
 
Paul Clapham
Sheriff
Posts: 22844
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes. A Java character is a Unicode character, so it has a unique code point. But when you write that character to a file, which is made up of bytes, then one of the many character encodings is used to do the mapping. So this mapping is not necessarily unique.
 
Campbell Ritchie
Marshal
Posts: 56600
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Do you mean can you cast like this, and get the arithmetically correct answer?Will you get 42 there?








No.
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Abigail Decan wrote:when casting a char which is read from a file to an int, can i assume that the mapping used will be ASCII?

No. That's the responsibility of the Reader (see java.io.Reader, or - possibly better - InputStreamReader) for full details.

A char is just that - a Java character - it has no knowledge of how it was created, which in this case is the important part.

Winston
 
Jesper de Jong
Java Cowboy
Sheriff
Posts: 16060
88
Android IntelliJ IDE Java Scala Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Converting between bytes and characters is done via a character encoding. A character encoding is what defines what sequence of numbers (bytes) represents what character. ASCII, for example, is a character encoding, which specifies that 65 means A, etc.

The ASCII character encoding is very limited, because it defines only 128 characters - far too few to be able to encode all the different characters that are used around the world. So, in the course of time, people came up with a whole list of other character encodings. With some of these encodings, a character is represented by two bytes, or even by a variable number of bytes.

Unicode is a standard way of dealing with text. It defines a family of character encodings, such as UTF-8 and UTF-16, to encode characters. UTF-8 is a variable-length encoding in which characters take up between one and six bytes.

Internally, Java represents characters using UTF-16, which uses two bytes per character. If you directly cast a char to an int, you get the UTF-16 code for the character.

Java has two kinds of classes for doing I/O:

Streams (InputStream and OutputStream) are for reading and writing bytes.

Readers and Writers (for example, FileReader, PrintWriter) are for reading and writing characters (text). They decode and encode bytes to and from characters, using a character encoding.

Some Readers and Writers allow you to specify the character encoding that should be used. For example, InputStreamReader has constructors that allow you to specify the encoding.

If you don't specify the encoding, then Java will use whatever is the default character encoding for your system.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!