Win a copy of Murach's Python Programming this week in the Jython/Python forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Understanding Byte Data and Character Encoding  RSS feed

 
Jaz Chana
Ranch Hand
Posts: 34
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Guys,

I am a relative newby in the programming world and I'm trying to get my head around byte and character data in java. Basically I don't understand what the difference is between byte data and character data.

From the beginning, all data has to eventually be transformed into 1s and 0s so that the computer can understand the data. This is where character encoding comes in. Now a bit is either a 1 or a 0. A byte is an octet/8 bit register, so a byte could look like 00001111. However this is useless for a human reader so it has to be converted into character data using some sort of character set and encoding. The most basic as I understand is ASCII which is a representation of 1 byte to 1 character (correct?).

Now in java there is a byte type and a character type. According to this source;

http://java.sun.com/docs/books/tutorial/java/nutsandbolts/datatypes.html

A byte is an '8-bit signed two's complement integer' and a char is a '16-bit Unicode character'. To me that means that a char is two bytes, with unicode character encoding. Or a byte is half a char without any encoding.

Is this true? It doesn't sound correct.

If that is true then why is it, for example, that the following application outputs the numbers: "104 101 108 108 111 32 119 111 114 108 100" (hex data I'm assuming), when I would have expected to see 0s and 1s?

by the way, the text in the file is "Hello World"



also why is it that a character stream in the next program pulling data from the same file outputs the same set of numbers?



I would have expected the string to output as it was to be represented, or at least be different from the byte data. It is after all using a different encoding.

Can someone please explain why?

Thank You
 
Martijn Verburg
author
Bartender
Posts: 3275
5
Eclipse IDE Java Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Jaz,

First of all thanks for posting such a great question! I can see you've already put some good thought into this

Originally posted by Jaz Chana:
Hi Guys,

From the beginning, all data has to eventually be transformed into 1s and 0s so that the computer can understand the data. This is where character encoding comes in. Now a bit is either a 1 or a 0. A byte is an octet/8 bit register, so a byte could look like 00001111. However this is useless for a human reader so it has to be converted into character data using some sort of character set and encoding. The most basic as I understand is ASCII which is a representation of 1 byte to 1 character (correct?).


That is correct yes, the ASCII code set is 1 character per byte



A byte is an '8-bit signed two's complement integer' and a char is a '16-bit Unicode character'. To me that means that a char is two bytes, with unicode character encoding. Or a byte is half a char without any encoding.

Is this true? It doesn't sound correct.


The first part is correct, the Unicode character set supports characters that can be represented with a single byte (Think about how many unique characters can be represented by a single byte...).

Now the use of a FileInputStream to read characters is _not_ recommended for this very reason. If you are reading in certain unicode characters (ones that take 2 bytes to represent) then reading off just one byte is only going to give you 1/2 a character. It's best to use FileReader instead.



If that is true then why is it, for example, that the following application outputs the numbers: "104 101 108 108 111 32 119 111 114 108 100" (hex data I'm assuming), when I would have expected to see 0s and 1s?

by the way, the text in the file is "Hello World"




The read() method is pulling back bytes from the stream as an int, so you're getting a decimal representation of the byte you are reading in. If you compare the numbers you are getting to a ascii decimal chart you'll see how it matches up to actual characters.



also why is it that a character stream in the next program pulling data from the same file outputs the same set of numbers?


I would have expected the string to output as it was to be represented, or at least be different from the byte data. It is after all using a different encoding.

Can someone please explain why?

Thank You

Basically all you are getting is a string representation of the number that is the int representation of the byte. Your code isn't actually ding the character set conversion 104 to "H" for example. I would take a look at the FileReader class to get the desired behaviour.

Hope that all helps!



[ September 08, 2008: Message edited by: Martijn Verburg ]
[ September 08, 2008: Message edited by: Martijn Verburg ]
 
Jaz Chana
Ranch Hand
Posts: 34
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you very much for your reply. Its has cleared things up a little. I have many more questions, but firstly I should clear something up. The code for the character data read was wrong. The code I was really referring to was this;



the other code actually does print the text output. I'll come back to the above in a sec, but firstly i have some other questions.

Okay, now I understand that the hex data relates to alphabet chars converted by the ascii table. But where does it state to use hex encoding? I thought that the encoding is determined by the stream and not the data type used to store it. I want it to use ASCII or unicode encoding.

Why does java use an int to read the data in? Surely it would be better off using a byte or a char value. The justification in the documentation (http://java.sun.com/docs/books/tutorial/essential/io/bytestreams.html) states that 'Using a int as a return type allows read() to use -1 to indicate that it has reached the end of the stream.' But what is wrong with using a byte or char and having null indicate the end of file, or even a String? In fact, how is it that char/byte data can at all be represented by an int?

It just doesn't make sense to me. :~

Leaving that aside for a moment, I am going to take this discussion up a level and talk about Blob (Binary Large Object) and Clob (Character Large Object) data. This whole problem came about as a result of my attempt to store some large xml and string data into a mysql database. I couldn't decided whether to use Clob or Blob in a mysql data base (clob is the same as lontext). If i understand correct, Blob would use ASCII encoding and Clob would use unicode. Since unicode is a superset of ASCII, you can store all ASCII characters and more in a Clob.

Given what I have learned, it makes more sense to store them as Clob. However, since there are more bytes per clob (since Unicode can uses up to 4 bytes per char) it uses more memory. Hence one advantage of a Blob over a Clob is use of memory.

Is this the same for char and byte data in java? Is byte data represented as ASCII? I've heard of byte data being referred to as binary. In reality thats not true is it? byte data is as close to binary as char data. Both byte and char data are encoded, the only difference is that they use different encodings. Is this correct? Is this the same for all data types? The only difference between them is the encoding? Is this the reason why an int can represent a byte?



Okay one last area is how the information is stored. So far we've been talking about taking data and representing it i different ways, but we haven't really delved into the initial state of the data. At the beginning we stated that everything is represented on a computer by 1s and 0s. If this is true than to a computer there is no difference between character and byte data. The difference only becomes apparent when the information has to be displayed.

When people talk about storing data I sometimes hear that they want to store that information as binary/byte data or as an array of bytes. For example, taking a string converting it to an array of bytes and storing it seems common place on the net. But why would you want to do that? Surely you would lose data if you did that? And considering that a byte could potentially split a char into 4 (assuming that the original data was unicode), is it true to say that the data would be corrupted and not possible to convert back?

In the above code I FileReader is a character stream that is really using a byte stream to read the data. That means that the data is stored as bytes but converted to characters when they are loaded. This makes no sense. How can anything be stored as a byte, assuming a byte is also encoded? Even if this was the case why does anyone use a character stream to read byte data. Taking the assumption that Unicode (which is character type encoding) is a superset of byte (which is ASCII encoding) your not going to be anything that needs to be represented in Unicode since it wouldn't be able to be stored as such? :~

As you can see I am extremely confused on the subject. I feel like i am on the verge of understanding, but there are some fundamental concepts that alude me. The biggest of which are around storage, retrieval and display.

I hope you are able to address these issues. If not could you point me in the direction of an article that can.

Thanks
Jaz
[ September 08, 2008: Message edited by: Jaz Chana ]
 
Martijn Verburg
author
Bartender
Posts: 3275
5
Eclipse IDE Java Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Jaz,

I'll try to answer some of this, it's been awhile since I've looked at the nuts and bolts of this so bear with me .

Originally posted by Jaz Chana:
Thank you very much for your reply. Its has cleared things up a little. I have many more questions, but firstly I should clear something up. The code for the character data read was wrong. The code I was really referring to was this;



the other code actually does print the text output. I'll come back to the above in a sec, but firstly i have some other questions.


OK, you're using a method there that's not quite going to give you what you want. If you follow the Javadoc for the FileReader API you'll see that the read() method you are using is still returning the characters as an int. You want to use an alternative read() method in which you pass in a character array that automatically gets filled.


Okay, now I understand that the hex data relates to alphabet chars converted by the ascii table. But where does it state to use hex encoding?
I thought that the encoding is determined by the stream and not the data type used to store it. I want it to use ASCII or unicode encoding.


Hex (and its corresponding decimal equivalent) are simply the next 'storage level', specifying ASCII/Unicode encoding simply tells it how many bytes to use for a character and what the mapping to the actual character should be.


Why does java use an int to read the data in? Surely it would be better off using a byte or a char value. The justification in the documentation (http://java.sun.com/docs/books/tutorial/essential/io/bytestreams.html) states that 'Using a int as a return type allows read() to use -1 to indicate that it has reached the end of the stream.' But what is wrong with using a byte or char and having null indicate the end of file, or even a String? In fact, how is it that char/byte data can at all be represented by
an int?

It just doesn't make sense to me. :~


Didn't make sense to a lot of us when the first I/O APIs came out .

Basically the Stream classes are designed to work at the 'lowest' level just above bits, which allows programmers or higher level API calls maximum flexibility. Your case of wanting to read the characters in a human readable form is only one of many use cases for those Classes/methods (I can give examples of other cases if that helps).


Leaving that aside for a moment, I am going to take this discussion up a level and talk about Blob (Binary Large Object) and Clob (Character Large Object) data. This whole problem came about as a result of my attempt to store some large xml and string data into a mysql database. I couldn't decided whether to use Clob or Blob in a mysql data base (clob is the same as longtext). If i understand correct, Blob would use ASCII encoding and Clob would use unicode. Since unicode is a superset of ASCII, you can store all ASCII characters and more in a Clob.


Hmm, I don't know much about mysql but you're correct about unicode being a superset of ASCII


Given what I have learned, it makes more sense to store them as Clob. However, since there are more bytes per clob (since Unicode can uses up to 4 bytes per char) it uses more memory. Hence one advantage of a Blob over a Clob is use of memory.


To go back to the original problem, how large is your XML? You may find that BLOB or CLOB storage is not required.


Is this the same for char and byte data in java? Is byte data represented as ASCII? I've heard of byte data being referred to as binary. In reality that's not true is it? byte data is as close to binary as char data. Both byte and char data are encoded, the only difference is that they use different encodings. Is this correct? Is this the same for all data types? The only difference between them is the encoding? Is this the reason why an int can represent a byte?


I'll answer it this way. Bytes are simply the lowest level building blocks for data, they can represent integers, characters etc. char data is at a slightly higher level as several bytes can make up a char, you'll find that byte and char Classes/methods seem almost the same which adds to the confusion, but the _are_ different.

The usage of an int to represent a byte is just a common low level way to represent the 8bit register. It has small memory footprint and is easy to manipulate (for example if you're going to blindly copy the contents of a file you wouldn't want to convert the bytes into actual characters and then copy, you'd just want to do a raw copy).


Okay one last area is how the information is stored. So far we've been talking about taking data and representing it i different ways, but we haven't really delved into the initial state of the data. At the beginning we stated that everything is represented on a computer by 1s and 0s. If this is true than to a computer there is no difference between character and byte data. The difference only becomes apparent when the information has to be displayed.


At the lowest level that is correct yes, computers only understand 1's and 0's, it's the programming data constructs and languages on top of that that convert the bytes into meaningful things.


When people talk about storing data I sometimes hear that they want to store that information as binary/byte data or as an array of bytes. For example, taking a string converting it to an array of bytes and storing it seems common place on the net. But why would you want to do that? Surely you would lose data if you did that? And considering that a byte could potentially split a char into 4 (assuming that the original data was unicode), is it true to say that the data would be corrupted and not possible to convert back?


Now you're getting to the heart of the matter! Putting Strings back into a byte array is common, as you know bytes are the lowest level so again it's a matter of efficiency etc to deal with that low level when you are performing low level operations (like copy). More often than not this low level behaviour is hidden by a higher level API call. You definitely don't lose data although you can 'corrupt' the data by reading it back in using the wrong encoding.

For example you encode (unicode) a complex character from ancient egypt and that gets stored as 4 bytes [100, 230, 5, 4]. You then read it back in as ascii and you get 4 separate characters 100, 230, 5 and 4 (because ascii encioding says 1 bytes == 1 character), however if you used the right reader/encoding (unicode) it knows to retrieve the full 4 bytes [100, 230, 5, 4] as a character.

I'd also recommend looking at the Sun tutorial and the Javaranch FAQ on this.


As you can see I am extremely confused on the subject. I feel like i am on the verge of understanding, but there are some fundamental concepts that alude me. The biggest of which are around storage, retrieval and display.


Actually I think you're doing extremely well, you're looking at this at a much lower level than most people would! For example, most people just use a high level API such as hibernate to save XML to a database, you'd literally go and it's done.

If you haven't thought about it already I'd recommend taking a few basic Computer Science papers, they deal with subjects like this and I suspect you'd be very good at it!

[/QB]

[ September 08, 2008: Message edited by: Martijn Verburg ]
 
Martijn Verburg
author
Bartender
Posts: 3275
5
Eclipse IDE Java Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I've heavily edited the post above adding in some more stuff and trying to get the quote tags sorted out, didn't really work , but do read the newly edited text for some useful links on this subject.
 
Jaz Chana
Ranch Hand
Posts: 34
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you very much, you've been instrumental in helping me understand. This will definitely be my first port of call the next time I have an issue.

I think I need to go over it a few times before I have a solid understanding. I am now realizing that actually I am not asking one question on one area, but several at once. Therefore I need to take time in digesting everything and asking those questions again.

However I have now enough information in solving the issue I have. The data I should store as Clob I think. The xml and the String data are extremely large. We are taking a possibility of several megabytes a string or xml data. I don't know if the data will contain more than just ASCII data, so to be on the safe side I think i should store it as Clob.

What do you think?

I will read those articles you sent. The one on the sun website I have read before, but I think i need to go over it again. The java ranch article is new to me and I will be going through it with a fine comb. As far as the computer science papers, are there any free online resources you can reccomend?

I was thinking about purchasing this book:

http://www.amazon.co.uk/Java-I-O-Elliotte-Harold/dp/0596527500/ref=sr_1_2?ie=UTF8&s=books&qid=1220878392&sr=8-2

do you have any opinions on it?
[ September 08, 2008: Message edited by: Jaz Chana ]
 
Martijn Verburg
author
Bartender
Posts: 3275
5
Eclipse IDE Java Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Jaz,

Originally posted by Jaz Chana:
Thank you very much, you've been instrumental in helping me understand. This will definitely be my first port of call the next time I have an issue.


I've only just started getting involved in this community recently, but I've found to be an excellent place to ask questions no matter what your level of experience. I've already learned some new things and been humbled on more than one occasion .


I think I need to go over it a few times before I have a solid understanding. I am now realizing that actually I am not asking one question on one area, but several at once. Therefore I need to take time in digesting everything and asking those questions again.

However I have now enough information in solving the issue I have. The data I should store as Clob I think. The xml and the String data are extremely large. We are taking a possibility of several megabytes a string or xml data. I don't know if the data will contain more than just ASCII data, so to be on the safe side I think i should store it as Clob.

What do you think?



It really depends on what you are storing those docs for. Do you want to be able to search on the contents of those docs?

* BLOB and CLOB are not easily human readable forms (and therefore are not searchable either), are you really gaining anything over storing the XML on a file system? Or do you have other meta data that you are storing in that same table which will assist you with searches etc?

* There are a number of XML extensions that some database vendors offer (mysql might be one of them) where you can actually store your XML in an 'XML datastore' and you can use XPath etc to directly search on that data.

* PS I'm assuming you're in academic research.

* You might find this article handy as well


I will read those articles you sent. The one on the sun forums I have read before, but I think i need to go over it again. The java ranch article is new to me and I will be going through it with a fine comb. As far as the computer science papers, are there any free online resources you can recommend?


You can find a site of links and resources Here. I'd Google for others as well and see what suits your personal preference.


I was thinking about purchasing this book:

http://www.amazon.co.uk/Java-I-O-Elliotte-Harold/dp/0596527500/ref=sr_1_2?ie=UTF8&s=books&qid=1220878392&sr=8-2

do you have any opinions on it?

Yes actually, it's very good, the author is pretty well known in Java and XML circles, if you Google him you'll find his Java and XML sites which also contain many resources. In general you can almost always trust an Oreillys title (when it comes to Java anyhow)


Hope that helps!
[ September 08, 2008: Message edited by: Martijn Verburg ]
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!