programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
• Campbell Ritchie
• Tim Cooke
• Devaka Cooray
• Ron McLeod
• Jeanne Boyarsky
Sheriffs:
• Liutauras Vilda
• paul wheaton
• Junilu Lacar
Saloon Keepers:
• Tim Moores
• Stephan van Hulst
• Piet Souris
• Carey Brown
• Tim Holloway
Bartenders:
• Martijn Verburg
• Frits Walraven
• Himai Minh

# string UTF8

Ranch Hand
Posts: 635
• Number of slices to send:
Optional 'thank-you' note:

Output

-39
-123

-39
-120
-39
-124
-40

mim is http://www.fileformat.info/info/unicode/char/645/index.htm

11011001:10000101

May someone explain how to calculate -39 to 11011001?

Ranch Hand
Posts: 479
• Number of slices to send:
Optional 'thank-you' note:
integers on modern binary computers handle negative numbers as "2's complement"; you create a 2's complement by reversing all the bits and adding one.

so 39 (decimal) is 27 (hex) is 0010 0111 binary.
reverse all the digits to get 1101 1000
and add one to get 1101 1001

So 11011001 represents -39 using standard 2's complement binary representation.

rc

abalfazl hossein
Ranch Hand
Posts: 635
• Number of slices to send:
Optional 'thank-you' note:

11111111111111111111111111011001
11111111111111111111111110000101
11111111111111111111111111011001
11111111111111111111111110001000
11111111111111111111111111011001
11111111111111111111111110000100
11111111111111111111111111011000
11111111111111111111111110100111
11111111111111111111111111011001
11111111111111111111111110000110
11111111111111111111111111011000
11111111111111111111111110100111

Does it mean that u0645=>11111111111111111111111111011001

Does this mean every unicode character occupy four bytes in memory?But according to character, maybe change one or two or more bytes?

UTF8 uses two bytes for this character:

http://www.fileformat.info/info/unicode/char/645/index.htm

UTF-8 (binary) 11011001:10000101

Sheriff
Posts: 22701
129
• Number of slices to send:
Optional 'thank-you' note:

abalfazl hossein wrote:Does it mean that u0645=>11111111111111111111111111011001

When cast to an int, yes.

Does this mean every unicode character occupy four bytes in memory?But according to character, maybe change one or two or more bytes?

Nope, just two. That's how char is defined. When encoded it may take up only one, but the char data type is always two bytes.

You're seeing four because you're not printing chars. You're trying to print bytes, but because you pass these to Integer.toBinaryString they get widened to ints.

abalfazl hossein
Ranch Hand
Posts: 635
• Number of slices to send:
Optional 'thank-you' note:
Can't UFT8 use 4 bytes?

Java Cowboy
Posts: 16084
88
• Number of slices to send:
Optional 'thank-you' note:
The UTF-8 encoding is a variable-length encoding; characters take up between one and four bytes.

abalfazl hossein wrote:Does it mean that u0645=>11111111111111111111111111011001

Does this mean every unicode character occupy four bytes in memory?But according to character, maybe change one or two or more bytes?

No.

Note that \u0645 is not a UTF-8 code, it's a two-byte Unicode code. In UTF-8, characters may be encoded with completely different numbers than two-byte Unicode code points.

Apparently \u0645 is encoded in UTF-8 with two bytes: -39, -123, which have bit patterns: 11011001, 10000101. Note that these are not the same as the two Unicode code point bytes (0x06 and 0x45) because UTF-8 is a different encoding than two-byte Unicode code points.

When you convert an 8-bit byte containing -39 (11011001) to a 32-bit int, you'll get 11111111111111111111111111011001 which is also -39, but in 32 bits instead of 8 bits.

So the 11111111111111111111111111011001 is just the first byte of \u0645 in UTF-8 encoding, converted to a 32-bit int.

Rancher
Posts: 43028
76
• Number of slices to send:
Optional 'thank-you' note:
UTF-8 can use up to 6 bytes per codepoint. But in memory Java uses UTF-16, which uses 2 bytes (and thus maps nicely to the char type) ... until you consider the subject of Unicode codepoints beyond the basic plane - which do not fit into 16 bits. The JavaIoFaq links to a couple of articles on that subject, and you should read http://www.joelonsoftware.com/articles/Unicode.html.

Jesper de Jong
Java Cowboy
Posts: 16084
88
• Number of slices to send:
Optional 'thank-you' note:
You're right, Ulf; the Wikipedia page in the intro mentions 1 to 4 bytes, but then later on says it can be up to 6 bytes. Probably an error in the intro of the Wikipedia page.

abalfazl hossein
Ranch Hand
Posts: 635
• Number of slices to send:
Optional 'thank-you' note:

In this line there is type cast myBytes[i] to int.The last byte in interger is used to save char. Right?

Ulf Dittmer
Rancher
Posts: 43028
76
• Number of slices to send:
Optional 'thank-you' note:

abalfazl hossein wrote:The last byte in interger is used to save char. Right?

No. Please read the article I linked to.

Marshal
Posts: 76879
366
• Number of slices to send:
Optional 'thank-you' note:
I think this is too difficult for "beginning", so I shall move this thread.

abalfazl hossein
Ranch Hand
Posts: 635
• Number of slices to send:
Optional 'thank-you' note:

myBytes[i] to int, Because toBinaryString accept int input.

Is it type cast?

Jesper de Jong
Java Cowboy
Posts: 16084
88
• Number of slices to send:
Optional 'thank-you' note:
The byte myBytes[i] is implicitly converted to an int (so, from 8 bits to 32 bits, with a widening primitive conversion) because toBinaryString() takes an int and not a byte. This is done by sign extension, which means that the extra bits on the left are filled with the leftmost bit (the sign bit) of the original byte.

For example: 11011001 -> leftmost bit is a 1, so when this is converted to a 32-bit int you get 11111111 11111111 11111111 11011001

But what do you mean with:

abalfazl hossein wrote:The last byte in interger is used to save char.

This line of code doesn't do anything with a char.

Campbell Ritchie
Marshal
Posts: 76879
366
• Number of slices to send:
Optional 'thank-you' note:

abalfazl hossein wrote: . . . Is it type cast?

No.

 No. No. No. No. Changed my mind. Wanna come down. To see this tiny ad: the value of filler advertising in 2021 https://coderanch.com/t/730886/filler-advertising
reply
Similar Threads