I know that the entropy represents the minimum number of bits needed to represent a symbol of a source of information.

P_[mi] define the probability of the i-th symbol of the file, where as a symbol assume an m bit string.

These probabilities can be estimated numerically evaluating the corresponding statistical frequencies.

And starting from the estimated probabilities I can calculate the entropy of order m.

So I have to write a program that is able to estimate the entropy of order m, for values of m <= 16 bit regarding a file.

Well, I don't know where to begin.

I don't want the task done, only suggestions to help me figure out how to start and what direction to take.

My ideas are:

1. create a program (in Java) that reads in a text input file

2. take for example m = 2. Now, knowing that the numbers are 10 (0...9) and the letters are 26 (A...Z), I should find out what is the frequency that each pair can be made among the 26 characters appear in the text.

3. since the frequency is comparable to the probability, I can then calculate the entropy.

In this case "easy" in which I considered only alphanumeric symbols and m = 2.

Then I can play by adding symbols and increasing and decreasing m.

It's correct?

I am very confused, I need your help.

Thank you very much

First, your general steps are correct, insofar as step 1 is create a program that reads text and steps 2-3 are apply an algorithm to that text to calculate the entropy. The big question is, what algorithm should you use? Here's where I start to have some questions:

1. You are expressing m in terms of bits, but your step 2 is talking about letters and numbers, which are expressed in bytes.

2. Have you been given any sort of direction on this in class (I am assuming this is a class assignment)? Are you supposed to develop the algorithm yourself, or just write a Java program to implement an algorithm? I could point you to an algorithm to use (the one I am thinking of works for m=8, but we can help you figure out how to adjust it for other values of m), but if that's not the assignment, I don't want to start down that path.

Perhaps if you were to post the actual text of the assignment, we might be able to provide a little clearer direction.

Piscis Babelis est parvus, flavus, et hiridicus, et est probabiliter insolitissima raritas in toto mundo.

Today I thought best at the problem and I was able to write some code in Java.

But I don't know if it is correct or not.

So in this way it should work properly.

The problem is that I don't know if the

`readFileInputStream`method takes really the bytes.

For example, if I consider as input file a text file like the one below, I get that

`mapO`and

`mapP`are:

__Text__:

*Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversations?*

__mapO__:

[

32 = 5

65 = 0

97 = 0

98 = 0

99 = 0

100 = 0

101 = 4

103 = 2

105 = 3

108 = 0

110 = 2

111 = 0

114 = 1

115 = 0

116 = 2

118 = 0

119 = 0

121 = 0

]

__mapP__(I removed symbols with probability=0 because I don't want to consider them):

[

32 = 0.2777777777777778

114 = 0.05555555555555555

116 = 0.1111111111111111

101 = 0.2222222222222222

103 = 0.1111111111111111

105 = 0.16666666666666666

110 = 0.1111111111111111

]

Why are 2 or 3-bit numbers (32 65 97 98 99 100 101...)? It should not be sequences of 8-bit type 00000000, 00000001, 00000010, etc?

I was wrong to use the method?

I don't know how to explain, I should calculate the entropy for different values of

`m`so I need to read the input file as a sequence of bits (

`m = 2`), sequence of bytes (

`m = 8`, which is what I tried to do), as a

`4-bit`sequence, etc.

How can I do this?

I must definitely use another method. I read that in Java you can read the files only to the maximum until the bytes, for example not as a bit sequence...

Why are 2 or 3-bit numbers (32 65 97 98 99 100 101...)? It should not be sequences of 8-bit type 00000000, 00000001, 00000010, etc?

I was wrong to use the method?

This is just the decimal representation of the byte (they are *not* 2 or 3-bit numbers, by the way!). By default, it will print that instead of the bit-string. So, 32 is 00100000, 114 is 01110010, etc. This means that 32 is a 6-bit number (it takes a minimum of 6 bits to represent it in base 2), and 114 is by the same reasoning a 7-bit number. Since bytes are 8-bit, these numbers can be represented by a byte. 597, however, cannot (it takes more than 8 bits to represent that value).

I must definitely use another method. I read that in Java you can read the files only to the maximum until the bytes, for example not as a bit sequence...

You are right, but I wouldn't change how you read the file. Instead, see point 2 below.

1). I would not use Byte as the keys in my maps, because that will not handle your larger m-values (Bytes handle a maximum of m = 8, and even then you might see some unexpected results if you print out the Map, since bytes in Java are signed!)

2). You will need another method to transform your byte[] input to an array (or List) of something else. For example, an input of byte[] {114} and m = 2 should give you back an array of two-bit numbers ({1, 3, 0, 2}). I leave it as an exercise to you to figure out why 114 would become those four numbers, but I gave you the answer already in this post. What should an input of byte[]{114} and m = 4 return? If you can figure that out, you are most of the way to writing the transformation function.

3). Do you have to handle cases like m = 3? That can be interesting, because there is no guarantee that a file will have an even multiple of 3 bits. What should be done in those cases? (What would byte[]{114} and m=3 return?(

But before you start those changes, can you get it working front-to-back for m = 8? What entropy value does it give you for that text input?

Piscis Babelis est parvus, flavus, et hiridicus, et est probabiliter insolitissima raritas in toto mundo.

Joel McNary wrote:1). I would not use Byte as the keys in my maps, because that will not handle your larger m-values (Bytes handle a maximum of m = 8, and even then you might see some unexpected results if you print out the Map, since bytes in Java are signed!)

Ok, so you're saying that it's better not to use a byte as a key because it would create problems when then I have to change the value of m (m = 1, m = 2, etc.)?

Joel McNary wrote:2). You will need another method to transform your byte[] input to an array (or List) of something else. For example, an input of byte[] {114} and m = 2 should give you back an array of two-bit numbers ({1, 3, 0, 2}). I leave it as an exercise to you to figure out why 114 would become those four numbers, but I gave you the answer already in this post. What should an input of byte[]{114} and m = 4 return? If you can figure that out, you are most of the way to writing the transformation function.

With m= 2 --> (114)_10 = (01 11 00 10)_2 where 01 = 1, 11 = 3, 00 = 0 and 10 = 2.

With m= 4 --> (114)_10 = (0111 0010)_2 where 0111 = 7 and 0010 = 2.

So I would need a Java function that, taken one byte (which I see as a decimal), decompose the bytes in portions long m with m variable.

Right? And how can I do that?

Joel McNary wrote:3). Do you have to handle cases like m = 3? That can be interesting, because there is no guarantee that a file will have an even multiple of 3 bits. What should be done in those cases? (What would byte[]{114} and m=3 return?

With m= 3 --> (114)_10 = (01 110 010)_2 where 110 = 6 and 010 = 2.

I don't consider the last two bits because they are not a correct pattern.

Question: is it right to throw the two most significant bit or the least significant?

Is correct to not consider the "excess" bits?

Joel McNary wrote:But before you start those changes, can you get it working front-to-back for m = 8? What entropy value does it give you for that text input?

I admit that I have understood very well what result I should expect ...

First of all, I'm sorry but I'mt not speaking very well English (rather so bad), so I could be wrong to interpret and/or answer.

You are doing quite well. At the very least, we are misunderstanding each other in the same way

Ok, so you're saying that it's better not to use a byte as a key because it would create problems when then I have to change the value of m (m = 1, m = 2, etc.)?

Yes. Especially for m > 8.

So I would need a Java function that, taken one byte (which I see as a decimal), decompose the bytes in portions long m with m variable.

Right? And how can I do that?

Yes, you are understanding what the function needs to do. In order to do that, look at bitwise comparators (the single-character '|' and '&' operators). You might also find the bit shift operators useful ( '<<', '>>', and '>>>'). This took me a little bit of thought to come up with an algorithm, so don't get too discouraged it it takes you a while as well.

Question: is it right to throw the two most significant bit or the least significant?

Is correct to not consider the "excess" bits?

That I do not know; file entropy is usually expressed with m = 8, so I don't know what would be correct for m = 3. If I, personally, were to discard anything, however, it would be the least significant bits.

Piscis Babelis est parvus, flavus, et hiridicus, et est probabiliter insolitissima raritas in toto mundo.

My idea:

- Transform the array

**buffer**in a list

**list**of elements long

*m*. But a list of what kind?

*Int*no,

*Double*no, of course

*bytes*no, uhm...

- For now I consider only the simplest case, that is, file multiple of

*m*or I don't consider the unnecessary bits

- I switch

**list**to

**calculateShannonEntropy**method that does the same things he does now, but maps have the same type of

**list**instead

*byte*.

Thank you so much for help you are giving me

Joel McNary wrote:I would suggest using Integer as the type for your list. Integers in Java are 32 bits, and you need to handle m <= 16, so Integer will be more than enough. If you want, you could use Short (which has a length of 16 bits), but if you start debugging you could see some unexpected display results and get confused. Short is a type really only used in memory-limited applications, though, and so isn't commonly used in classroom assignments. Other than that, your next steps look correct to me.

Ok, then tomorrow I go ahead following your advice. Thank you

The code would be something like:

I found that:

but it return only the byte corresponding to bit in position

**position**.

Not a sequence of bit.. I would take 2 bit (if m=2), 3 bit (if m=3), ...

And if I read bytes, then how can I consider sequences of m>8?

I change my InputStream from Byte to Integer?

I'm confused..

I consider it an integer (32 bits) as recommended:

Where:

**n**is an item in the

**buffer**, so the

*buffer[i]*with

*i=0...buffer.length-1*

**m**is the length of the sequence I must build

This seems to work for me for

**m = 1**,

**m = 2**,

**m = 4**,

**m = 8**and

**m = 16**.

In other cases not because they are not multiples of 32, so I should delete the remaining part.

The problem is create a strem of short. I decide to use short because I think it's easier to manipulate single bit.

This is the cycle I think to use once I had a list (

*buffer*) of short:

In this case my problem is create a stream of short, so create an array of short.

This is what I've done:

But I lose a byte and I don't think this is the best way to proced.

If I follow the advice to use integers instead of short I do not know how to manipulate the bits and how to create the array int[] buffer from file.

I don't think that using a loop like this may be fine.

So not sure how continue..I know it is easy but I'm in trouble.

Thanks

Don't get me started about those stupid light bulbs. |