I'm trying to save data from an HTMLEditor, and then load it again later. Normally it works fine, except when I have a Unicode but non-ASCII character.
I can paste such characters into it, but when I debug the saving, it doesn't say ---; but instead it shows the actual character in the HTML, which implies that it's just putting its actual Unicode representation directly there! Even if I load a file that has it in the ---; format, then when I save it still just stupidly translates it into the character! Then the next time I open the same file it only appears as a question mark, like it's too stupid to read its own characters!
So I started reading the actual numbers of the bytes to see what characters they are, thinking that I'll just translate them manually into the ---; format after loading them, and I get strange results. For example, if I read a lowercase Greek alpha character I get it in a single byte, and the first two bits of its value are 10, which according to UTF-8 means that byte should be a continuation, and should never be the first, or for that matter the only byte for a character! And I looked up what the number for an alpha is supposed to be, and it's like 913 or something, which can't possibly fit in a single byte.
So is it using something other than UTF-8 and if so, what? Or is there some way to prevent it from converting the HTML instruction to print its number into the actual encoded number itself?
FURTHER EDIT: Actually, I think I might have figured out what's wrong (but it's partially from memory since I don't have the code with me), but please tell me if I'm wrong: I think I was converting the char type to a byte type, and since it may be 2 bytes, for example, it was just cutting off the high byte, keeping only the low byte and using that as the byte. Is that what it does when you convert a multi-byte char to a byte type, or not? Because when I calculate what all the bits would be if I did that, it seems to be getting a number that may match what I was getting when I debugged.
Terrance Samson wrote:I can paste such characters into it, but when I debug the saving, it doesn't say ---; but instead it shows the actual character in the HTML, which implies that it's just putting its actual Unicode representation directly there!
I don't understand that. It's perfectly normal to write HTML using the UTF-8 encoding, and to just put UTF-8-encoded characters into it. There's no need to use HTML escapes.
posted 2 months ago
Yes, I realize that I shouldn't want to make it a byte, but I was temporarily confused. I'm so used to thinking of a character as a byte that I was just thinking it was giving me an array of bytes, each one being either a character or a piece of one, when in fact, sometimes it was a piece of one, but then the rest was missing.
Anyway, I'm trying to do some other stuff with it and I need it to actually have the character codes printed, but as it turns out, I was doing it the hard way. All I had to do was convert each char to an int, and if it was larger than 255 then I'd print its number with a & before and ; after, but if it was <= 255 then I'd just print the char. Now it works fine. You should see the monstrosity I was trying to build to manually convert the bytes of a UTF-8 char to an int representing its number!