programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
Sheriffs:
Saloon Keepers:
Bartenders:

# Find the entropy of order m of a text file

Ranch Hand
Posts: 66
Hi, I have to write a program that is able to estimate the probability of each symbol of a file and then evaluate the entropy.

I know that the entropy represents the minimum number of bits needed to represent a symbol of a source of information.
P_[mi] define the probability of the i-th symbol of the file, where as a symbol assume an m bit string.
These probabilities can be estimated numerically evaluating the corresponding statistical frequencies.
And starting from the estimated probabilities I can calculate the entropy of order m.

So I have to write a program that is able to estimate the entropy of order m, for values of m <= 16 bit regarding a file.
Well, I don't know where to begin.
I don't want the task done, only suggestions to help me figure out how to start and what direction to take.

My ideas are:
1. create a program (in Java) that reads in a text input file
2. take for example m = 2. Now, knowing that the numbers are 10 (0...9) and the letters are 26 (A...Z), I should find out what is the frequency that each pair can be made among the 26 characters appear in the text.
3. since the frequency is comparable to the probability, I can then calculate the entropy.

In this case "easy" in which I considered only alphanumeric symbols and m = 2.
Then I can play by adding symbols and increasing and decreasing m.
It's correct?

I am very confused, I need your help.
Thank you very much

Bartender
Posts: 1840
Your question is not entirely clear, but I think I might begin to understand what you are asking -- certainly, when you are confused about an assignment, it can be difficult to ask clear questions!

First, your general steps are correct, insofar as step 1 is create a program that reads text and steps 2-3 are apply an algorithm to that text to calculate the entropy. The big question is, what algorithm should you use? Here's where I start to have some questions:

1. You are expressing m in terms of bits, but your step 2 is talking about letters and numbers, which are expressed in bytes.

2. Have you been given any sort of direction on this in class (I am assuming this is a class assignment)? Are you supposed to develop the algorithm yourself, or just write a Java program to implement an algorithm? I could point you to an algorithm to use (the one I am thinking of works for m=8, but we can help you figure out how to adjust it for other values of m), but if that's not the assignment, I don't want to start down that path.

Perhaps if you were to post the actual text of the assignment, we might be able to provide a little clearer direction.

Alicia Perry
Ranch Hand
Posts: 66
Hi Joel McNary, thanks for the reply.
Today I thought best at the problem and I was able to write some code in Java.
But I don't know if it is correct or not.

So in this way it should work properly.
The problem is that I don't know if the readFileInputStream method takes really the bytes.

For example, if I consider as input file a text file like the one below, I get that mapO and mapP are:

Text:
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversations?

mapO:
[
32 = 5
65 = 0
97 = 0
98 = 0
99 = 0
100 = 0
101 = 4
103 = 2
105 = 3
108 = 0
110 = 2
111 = 0
114 = 1
115 = 0
116 = 2
118 = 0
119 = 0
121 = 0
]

mapP (I removed symbols with probability=0 because I don't want to consider them):
[
32 = 0.2777777777777778
114 = 0.05555555555555555
116 = 0.1111111111111111
101 = 0.2222222222222222
103 = 0.1111111111111111
105 = 0.16666666666666666
110 = 0.1111111111111111
]

Why are 2 or 3-bit numbers (32 65 97 98 99 100 101...)? It should not be sequences of 8-bit type 00000000, 00000001, 00000010, etc?
I was wrong to use the method?

I don't know how to explain, I should calculate the entropy for different values of m so I need to read the input file as a sequence of bits (m = 2), sequence of bytes (m = 8, which is what I tried to do), as a 4-bit sequence, etc.

How can I do this?
I must definitely use another method. I read that in Java you can read the files only to the maximum until the bytes, for example not as a bit sequence...

Joel McNary
Bartender
Posts: 1840
You are headed along the correct path.

Why are 2 or 3-bit numbers (32 65 97 98 99 100 101...)? It should not be sequences of 8-bit type 00000000, 00000001, 00000010, etc?
I was wrong to use the method?

This is just the decimal representation of the byte (they are *not* 2 or 3-bit numbers, by the way!). By default, it will print that instead of the bit-string. So, 32 is 00100000, 114 is 01110010, etc. This means that 32 is a 6-bit number (it takes a minimum of 6 bits to represent it in base 2), and 114 is by the same reasoning a 7-bit number. Since bytes are 8-bit, these numbers can be represented by a byte. 597, however, cannot (it takes more than 8 bits to represent that value).

I must definitely use another method. I read that in Java you can read the files only to the maximum until the bytes, for example not as a bit sequence...

You are right, but I wouldn't change how you read the file. Instead, see point 2 below.

1). I would not use Byte as the keys in my maps, because that will not handle your larger m-values (Bytes handle a maximum of m = 8, and even then you might see some unexpected results if you print out the Map, since bytes in Java are signed!)
2). You will need another method to transform your byte[] input to an array (or List) of something else. For example, an input of byte[] {114} and m = 2 should give you back an array of two-bit numbers ({1, 3, 0, 2}). I leave it as an exercise to you to figure out why 114 would become those four numbers, but I gave you the answer already in this post. What should an input of byte[]{114} and m = 4 return? If you can figure that out, you are most of the way to writing the transformation function.
3). Do you have to handle cases like m = 3? That can be interesting, because there is no guarantee that a file will have an even multiple of 3 bits. What should be done in those cases? (What would byte[]{114} and m=3 return?(

But before you start those changes, can you get it working front-to-back for m = 8? What entropy value does it give you for that text input?

Alicia Perry
Ranch Hand
Posts: 66
First of all, I'm sorry but I'mt not speaking very well English (rather so bad), so I could be wrong to interpret and/or answer.

Joel McNary wrote:1). I would not use Byte as the keys in my maps, because that will not handle your larger m-values (Bytes handle a maximum of m = 8, and even then you might see some unexpected results if you print out the Map, since bytes in Java are signed!)

Ok, so you're saying that it's better not to use a byte as a key because it would create problems when then I have to change the value of m (m = 1, m = 2, etc.)?

Joel McNary wrote:2). You will need another method to transform your byte[] input to an array (or List) of something else. For example, an input of byte[] {114} and m = 2 should give you back an array of two-bit numbers ({1, 3, 0, 2}). I leave it as an exercise to you to figure out why 114 would become those four numbers, but I gave you the answer already in this post. What should an input of byte[]{114} and m = 4 return? If you can figure that out, you are most of the way to writing the transformation function.

With m= 2 --> (114)_10 = (01 11 00 10)_2 where 01 = 1, 11 = 3, 00 = 0 and 10 = 2.
With m= 4 --> (114)_10 = (0111 0010)_2 where 0111 = 7 and 0010 = 2.

So I would need a Java function that, taken one byte (which I see as a decimal), decompose the bytes in portions long m with m variable.
Right? And how can I do that?

Joel McNary wrote:3). Do you have to handle cases like m = 3? That can be interesting, because there is no guarantee that a file will have an even multiple of 3 bits. What should be done in those cases? (What would byte[]{114} and m=3 return?

With m= 3 --> (114)_10 = (01 110 010)_2 where 110 = 6 and 010 = 2.
I don't consider the last two bits because they are not a correct pattern.
Question: is it right to throw the two most significant bit or the least significant?
Is correct to not consider the "excess" bits?

Joel McNary wrote:But before you start those changes, can you get it working front-to-back for m = 8? What entropy value does it give you for that text input?

I admit that I have understood very well what result I should expect ...

Joel McNary
Bartender
Posts: 1840

First of all, I'm sorry but I'mt not speaking very well English (rather so bad), so I could be wrong to interpret and/or answer.

You are doing quite well. At the very least, we are misunderstanding each other in the same way

Ok, so you're saying that it's better not to use a byte as a key because it would create problems when then I have to change the value of m (m = 1, m = 2, etc.)?

Yes. Especially for m > 8.

So I would need a Java function that, taken one byte (which I see as a decimal), decompose the bytes in portions long m with m variable.
Right? And how can I do that?

Yes, you are understanding what the function needs to do. In order to do that, look at bitwise comparators (the single-character '|' and '&' operators). You might also find the bit shift operators useful ( '<<', '>>', and '>>>'). This took me a little bit of thought to come up with an algorithm, so don't get too discouraged it it takes you a while as well.

Question: is it right to throw the two most significant bit or the least significant?
Is correct to not consider the "excess" bits?

That I do not know; file entropy is usually expressed with m = 8, so I don't know what would be correct for m = 3. If I, personally, were to discard anything, however, it would be the least significant bits.

Alicia Perry
Ranch Hand
Posts: 66
Ok all clear. So now I have to change the procedure.

My idea:
- Transform the array buffer in a list list of elements long m. But a list of what kind? Int no, Double no, of course bytes no, uhm...
- For now I consider only the simplest case, that is, file multiple of m or I don't consider the unnecessary bits
- I switch list to calculateShannonEntropy method that does the same things he does now, but maps have the same type of list instead byte.

Joel McNary
Bartender
Posts: 1840
I would suggest using Integer as the type for your list. Integers in Java are 32 bits, and you need to handle m <= 16, so Integer will be more than enough. If you want, you could use Short (which has a length of 16 bits), but if you start debugging you could see some unexpected display results and get confused. Short is a type really only used in memory-limited applications, though, and so isn't commonly used in classroom assignments. Other than that, your next steps look correct to me.

Alicia Perry
Ranch Hand
Posts: 66

Joel McNary wrote:I would suggest using Integer as the type for your list. Integers in Java are 32 bits, and you need to handle m <= 16, so Integer will be more than enough. If you want, you could use Short (which has a length of 16 bits), but if you start debugging you could see some unexpected display results and get confused. Short is a type really only used in memory-limited applications, though, and so isn't commonly used in classroom assignments. Other than that, your next steps look correct to me.

Alicia Perry
Ranch Hand
Posts: 66
Mmm ok, I searched about bitwise operators but I don't know how to write the code.

The code would be something like:
• take 1 byte
• split the byte in sequences of length m
• put all thsese sequences in a list

• I found that:

but it return only the byte corresponding to bit in position position.
Not a sequence of bit.. I would take 2 bit (if m=2), 3 bit (if m=3), ...

And if I read bytes, then how can I consider sequences of m>8?
I change my InputStream from Byte to Integer?
I'm confused..

Alicia Perry
Ranch Hand
Posts: 66
I wrote this in pseudocode.. I don't know if it is correct, could you help me?
I consider it an integer (32 bits) as recommended:

Where:
• n is an item in the buffer, so the buffer[i] with i=0...buffer.length-1
• m is the length of the sequence I must build

• This seems to work for me for m = 1, m = 2, m = 4, m = 8 and m = 16.
In other cases not because they are not multiples of 32, so I should delete the remaining part.

Alicia Perry
Ranch Hand
Posts: 66
Help..
The problem is create a strem of short. I decide to use short because I think it's easier to manipulate single bit.

This is the cycle I think to use once I had a list (buffer) of short:

In this case my problem is create a stream of short, so create an array of short.
This is what I've done:

But I lose a byte and I don't think this is the best way to proced.

If I follow the advice to use integers instead of short I do not know how to manipulate the bits and how to create the array int[] buffer from file.
I don't think that using a loop like this may be fine.

So not sure how continue..I know it is easy but I'm in trouble.

Thanks

 Always! Wait. Never. Shut up. Look at this tiny ad. Rocket Oven Kickstarter - from the trailboss https://coderanch.com/t/695773/Rocket-Oven-Kickstarter-trailboss