string UTF8

abalfazl hossein
Ranch Hand
Posts: 635

Output
-39
-123

-39
-120
-39
-124
-40

mim is http://www.fileformat.info/info/unicode/char/645/index.htm

11011001:10000101

May someone explain how to calculate -39 to 11011001?

Ralph Cook
Ranch Hand
Posts: 479
integers on modern binary computers handle negative numbers as "2's complement"; you create a 2's complement by reversing all the bits and adding one.

so 39 (decimal) is 27 (hex) is 0010 0111 binary.
reverse all the digits to get 1101 1000
and add one to get 1101 1001

So 11011001 represents -39 using standard 2's complement binary representation.

rc

abalfazl hossein
Ranch Hand
Posts: 635

11111111111111111111111111011001
11111111111111111111111110000101
11111111111111111111111111011001
11111111111111111111111110001000
11111111111111111111111111011001
11111111111111111111111110000100
11111111111111111111111111011000
11111111111111111111111110100111
11111111111111111111111111011001
11111111111111111111111110000110
11111111111111111111111111011000
11111111111111111111111110100111

Does it mean that u0645=>11111111111111111111111111011001

Does this mean every unicode character occupy four bytes in memory?But according to character, maybe change one or two or more bytes?

UTF8 uses two bytes for this character:

http://www.fileformat.info/info/unicode/char/645/index.htm

UTF-8 (binary) 11011001:10000101

Rob Spoor
Sheriff
Posts: 20661
65
abalfazl hossein wrote:Does it mean that u0645=>11111111111111111111111111011001

When cast to an int, yes.

Does this mean every unicode character occupy four bytes in memory?But according to character, maybe change one or two or more bytes?

Nope, just two. That's how char is defined. When encoded it may take up only one, but the char data type is always two bytes.

You're seeing four because you're not printing chars. You're trying to print bytes, but because you pass these to Integer.toBinaryString they get widened to ints.

abalfazl hossein
Ranch Hand
Posts: 635
Can't UFT8 use 4 bytes?

Jesper de Jong
Java Cowboy
Saloon Keeper
Posts: 15480
43
The UTF-8 encoding is a variable-length encoding; characters take up between one and four bytes.

abalfazl hossein wrote:Does it mean that u0645=>11111111111111111111111111011001

Does this mean every unicode character occupy four bytes in memory?But according to character, maybe change one or two or more bytes?

No.

Note that \u0645 is not a UTF-8 code, it's a two-byte Unicode code. In UTF-8, characters may be encoded with completely different numbers than two-byte Unicode code points.

Apparently \u0645 is encoded in UTF-8 with two bytes: -39, -123, which have bit patterns: 11011001, 10000101. Note that these are not the same as the two Unicode code point bytes (0x06 and 0x45) because UTF-8 is a different encoding than two-byte Unicode code points.

When you convert an 8-bit byte containing -39 (11011001) to a 32-bit int, you'll get 11111111111111111111111111011001 which is also -39, but in 32 bits instead of 8 bits.

So the 11111111111111111111111111011001 is just the first byte of \u0645 in UTF-8 encoding, converted to a 32-bit int.

Ulf Dittmer
Rancher
Posts: 42968
73
UTF-8 can use up to 6 bytes per codepoint. But in memory Java uses UTF-16, which uses 2 bytes (and thus maps nicely to the char type) ... until you consider the subject of Unicode codepoints beyond the basic plane - which do not fit into 16 bits. The JavaIoFaq links to a couple of articles on that subject, and you should read http://www.joelonsoftware.com/articles/Unicode.html.

Jesper de Jong
Java Cowboy
Saloon Keeper
Posts: 15480
43
You're right, Ulf; the Wikipedia page in the intro mentions 1 to 4 bytes, but then later on says it can be up to 6 bytes. Probably an error in the intro of the Wikipedia page.

abalfazl hossein
Ranch Hand
Posts: 635

In this line there is type cast myBytes[i] to int.The last byte in interger is used to save char. Right?

Ulf Dittmer
Rancher
Posts: 42968
73
abalfazl hossein wrote:The last byte in interger is used to save char. Right?

Campbell Ritchie
Sheriff
Posts: 50175
79
I think this is too difficult for "beginning", so I shall move this thread.

abalfazl hossein
Ranch Hand
Posts: 635

myBytes[i] to int, Because toBinaryString accept int input.

Is it type cast?

Jesper de Jong
Java Cowboy
Saloon Keeper
Posts: 15480
43
The byte myBytes[i] is implicitly converted to an int (so, from 8 bits to 32 bits, with a widening primitive conversion) because toBinaryString() takes an int and not a byte. This is done by sign extension, which means that the extra bits on the left are filled with the leftmost bit (the sign bit) of the original byte.

For example: 11011001 -> leftmost bit is a 1, so when this is converted to a 32-bit int you get 11111111 11111111 11111111 11011001

But what do you mean with:
abalfazl hossein wrote:The last byte in interger is used to save char.

This line of code doesn't do anything with a char.

Campbell Ritchie
Sheriff
Posts: 50175
79
abalfazl hossein wrote: . . . Is it type cast?
No.