programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
Sheriffs:
Saloon Keepers:
Bartenders:

# Find entropy of dictionary

Ranch Hand
Posts: 66
I have a trie data structure that stores a sequence of English words. For example, given these (non-sense) words, the dictionary is this:

aa abc aids aimed ami amo b browne brownfield brownie browser brut
butcher casa cash cicca ciccio cicelies cicero cigar ciste conv cony
crumply diarca diarchial dort eserine excursus foul gawkishly he
insomniac mehuman occluding poverty pseud rumina skip slagging
socklessness unensured waspily yes yoga z zoo

The nodes in blue are those in which a word is finished.

In each node I saved:
• the character that it represents
• the level at which the node is located
• a counter that indicates how many words "pass" for that node
• the entropy of the next character of the node.

• I would find the entropy for each level of the tree and the total entropy of the dictionary.

This is a piece of class TrieNode that rapresent a single node:

And this is a piece of class Trie with some methods to manipulate the trie data structure:

Now I would find the entropy of each level of the tree.

To do this I created the following method that creates two data structures (level and levelArray).
levelArray is an array of double containing the result, that is, in the index i there is the entropy of the i-level.
level is a arrayList of arrayList. Each list contains the weighted entropy of each node.

If I run this method on the sample dictionary I get:

[lev 1] 10.355154029112995
[lev 2] 0.6557748405420764
[lev 3] 0.2127659574468085
[lev 4] 0.23925771271992619
[lev 5] 0.17744361708265158
[lev 6] 0.0
[lev 7] 0.0
[lev 8] 0.0
[lev 9] 0.0
[lev 10] 0.0
[lev 11] 0.0
[lev 12] 0.0

The problem is that I don't understand if the result is likely or completely wrong.

Another problem: I would like to calculate the entropy of the entire dictionary. To do so I thought I'd add the values present in levelArray. It's rigth a procedure like that? If I do that I obtain that the entropy of the entire dictionary is 11.64.

I need some advice. No wonder code, but I would understand if the solutions I proposed to solve the two problems are corrected.

The example I proposed is very simple. In reality these methods must work on a real English dictionary of about 200800 words. And if I apply these methods in this dictionary I get numbers like these (in my opinion are excessive).

Entropy for each level:

[lev 1] 65.30073504641602
[lev 2] 44.49825655981045
[lev 3] 37.812193162250765
[lev 4] 18.24599038562219
[lev 5] 7.943507700803994
[lev 6] 4.076715421729149
[lev 7] 1.5934893456776191
[lev 8] 0.7510203704630074
[lev 9] 0.33204345165280974
[lev 10] 0.18290941591943546
[lev 11] 0.10260282173581108
[lev 12] 0.056284946780556455
[lev 13] 0.030038717136269627
[lev 14] 0.014766733727532396
[lev 15] 0.007198162552512713
[lev 16] 0.003420610593927708
[lev 17] 0.0013019239303215001
[lev 18] 5.352246905990619E-4
[lev 19] 2.1483959981088307E-4
[lev 20] 8.270156797847352E-5
[lev 21] 7.327868866691726E-5
[lev 22] 2.848394217759738E-6
[lev 23] 6.6648152186416716E-6
[lev 24] 0.0
[lev 25] 8.545182653279214E-6
[lev 26] 0.0
[lev 27] 0.0
[lev 28] 0.0
[lev 29] 0.0
[lev 30] 0.0
[lev 31] 0.0

I think they are wrong. And the entropy of the entire dictionary can not be calculated, I think that the sum it's a process too long so as I'm not able to see the result.

For this I would understand if the methods that I have written are right.

I don't understand what is wrong.
Thanks a lot in advance

Master Rancher
Posts: 2410
80
hi Alicia,

long time no see! How are you?

First time I see a problem like this, so please bear with me.

Have you tested your classes and methods on a very small dictionary, say with {'a', 'ab', 'bb', 'c', 'ca' }, where you can manually calculate some outcomes?

Second, an entropy of 0 on level 6 doesn't surprise me. If I read the picture well, then every node at level 6 has at most one childnode, and so the entropy
is:

- sum {pi * log_2(pi)} = - 1 * log_2(1) = 0

Well, it'll be tomorrow evening before I have time to have another look, so hopefully some information theorist cold step in. If not, I'll be back.

Greetings,
Piet

Alicia Perry
Ranch Hand
Posts: 66
Hi Piet, all right, thanks, you? :-)

I created the dictionary you suggested me and I think I noticed a possible error.
I had already created simple dictionaries but I had never noticed this error.

Here is the figure:

The nodes of the path {*, A} and {*, C} should have entropy 0 and not 0.5. Right? In that case there isn't uncertainty, then the entropy should be 0.

Ok, I don't know how to fix it but I'm "happy" to have found it (if the error is this).

Piet Souris
Master Rancher
Posts: 2410
80
Hi Alicia,

I downloaded a long Wikipedia article this morning, so that’ll give me some reading for tonight.

Meanwhile, if I look at our tiny dictionary, I wonder:

Given an ‘a’ at level 1 (so in fact I have a word that starts with an ‘a’), and I ask what possible outcomes I have one level down, I see the following possibilities:

1) The ‘empty’ character, meaning end-of-word
2) ‘b’

Both with probability 0.5.

If I define my own notation, I’d say that the ‘entropy’, given an ‘a’ at level 1, is:

- { 0.5 * log_2(0.5) + 0.5 * log_2(0.5) } = - log_2(0.5) = - (-1) = 1

So H(a, 1, 1) (is entropy, given an ‘a’ at level 1 and going 1 level deep) = 1.

Now, intuitively, I’d say that the entropy of the whole dictionary, over all levels, is simply equal to

- Sum ( 1 / N * log_2(1/N) = - log_2(1/N), with N being the number of words, AND assuming all words are equally likely.

Well, so far is what I had time for. Can you tell me if what I wrote makes any sense?

Alicia Perry
Ranch Hand
Posts: 66

Piet Souris wrote:Given an ‘a’ at level 1 (so in fact I have a word that starts with an ‘a’), and I ask what possible outcomes I have one level down, I see the following possibilities:

1) The ‘empty’ character, meaning end-of-word
2) ‘b’

Both with probability 0.5.

Why both have probability 0.5? According to me, the (blank) " " character has probability 0 while "b" character has probability 1. In fact, once you are on the "*"->"a" node, you can only choose the character "b", you have no other alternatives.
In the dictionary there is no " " character, only a_z letters, so it makes no sense to assign a probability to a character that does not exist

Piet Souris wrote:
If I define my own notation, I’d say that the ‘entropy’, given an ‘a’ at level 1, is:

- { 0.5 * log_2(0.5) + 0.5 * log_2(0.5) } = - log_2(0.5) = - (-1) = 1

So H(a, 1, 1) (is entropy, given an ‘a’ at level 1 and going 1 level deep) = 1.

I would say that the entropy, given an "a" at level 1, is 0 because there is not uncertainty as to which character to choose, since I can only choose "b".

If the "*"->"a" node had two children equally likely, then entropy would be 1.

Bartender
Posts: 10575
66

Alicia Perry wrote:I have a trie data structure that stores a sequence of English words.
In each node I saved:

• the character that it represents
• the level at which the node is located
• a counter that indicates how many words "pass" for that node
• the entropy of the next character of the node.

• Actually, you've stored a bit more than that, and I wonder if some of the things that you store aren't redundant.

Also: DON'T make instance variables public. EVER.

Now, bear in mind that I'm not an expert on tries, so I'm going somewhat by 'intuition' here; but let me try to explain myself:

You have a Trie class that contains a set of TrieNodes; each of which represents the placement of one letter in a word, based on words that the Trie has been supplied with so far. So, to me, the Trie itself IS the "dictionary". It just isn't stored as an array of words.

If I'm right, then emprically, a TrieNode needs only three things:
1. The letter it represents.
2. A list of letters (ie nodes) that come after it in any word that has been processed.
3. Something that indicates whether it represents the "end" of a word in the Trie, whether or not there are any longer words with the same prefix.
This could be done either with a flag or, as Piet appears to suggest, with some special "null" node included in the "child" list. I have a slight preference for the first (ie, the way you've done it), but there may be reasons for doing it as Piet suggests that I don't know.

There may also be good reasons for storing other values (eg, efficiency), but I doubt they're actually needed to represent the structure.

Now let me take the values you have one-by-one:
• char content; - Absolutely required.
• boolean isEnd; - Looks fine to me.
• double count; - This strikes me as a "Trie" value, not one that needs to be stored in a TrieNode (more on that below).
• LinkedList<TrieNode> childList; - Absolutely required - although I think I might make it an ArrayList to save space.
• public String path = ""; - This looks a derived value; probably part of a search; so I wouldn't include it.
• double entropyNextChar; - Like 'count' above, this looks like a "Trie" value to me.
• int level; - Possibly needed, but I suspect not. A 'maxDepth' value in your Trie class might be a very good idea however.

• So, assuming you've removed 'count' and 'entropy' and the rest from your TrieNode, how do you store them?

Well, Java has a very useful structure called an IdentityHashMap, that allows you to quickly retrieve values for specific objects. So, you could either set up one for each value, or you could create a NodeInfo class that contains all "non-essential" information about a TrieNode, and store that in an IdentityHashMap<TrieNode, NodeInfo>.

Then, whenever you encounter a TrieNode and you want to know its "count", you simply look it up in the Map.
Some of these values (eg, 'count') could also probably be calculated while the Trie is being built; although I have to admit that it "smells" like a value that might be better calculated via a method as required - but TBH I don't know.

Hope it helps ... and also that if I'm wrong about anything, someone flames me.

Winston

Winston Gutkowski
Bartender
Posts: 10575
66

Winston Gutkowski wrote:If I'm right, then emprically, a TrieNode needs only three things:

Actually, if I'm being really picky, it probably only needs two: - content, and childList - since 'isEnd' is related more to the Trie containing the node.

The only thing it's really needed for is if you want to convert the structure back to an array or List of "words", or check that a supplied search string, is in fact, an actual word in the dictionary (it could just be a prefix). It really doesn't have any meaning to an individual node though.

Winston

Piet Souris
Master Rancher
Posts: 2410
80
@Alicia

Not sure about your last reply. The way I look at it is what I already wrote. Given that I have an 'a' at level 1, then according to me there are two possibilities: either we're dealing with the word 'a' or we are dealing with the word 'ab'. Two possibilities therefore, with an entropy of 1. Now, in order to get that entropy, we must calculate the p for every child, taking that childs 'count' into consideration.
And that's why I talked about a special kind of null-character, so that the formula would only have to look at the children, and not also to inspect the 'isEnd' variable.
(@Wiston: that was my sole intention for such an element).

But we do not necessarrily have to invent some special character, we could also have one (and only one) special TrieNode, indicating an 'end-of-word'.

Well, it is just a thought.

Now, before looking at what datastructure would be handy, let us first look at what kind of calculations do we need to perform..

And by the look of it, the TrieClass looks suitable to me.

So, given a character at some level, the chances would be: for each child j: j.childlist.size() / sum of all childlistsizes. (with possibly a correction if the 'isEnd' is used or not)). From this, the entropy can be deduced easily.

It makes therefore sense to have variable 'count' as member of a node, to avoid this summing at all the nodes.

But this is a 1 level deep calculation. What is meant by ; the entropy of the whole dictionary? And is this question also possible: given an 'a', what is its entropy (so, irrespective of the level of 'a').

@Alicia: can you inform us about what kind of questions are to be expected? And how do you define the entropy of the whole dictionary? That Wikipedia article was not clear on that aspect.

Winston Gutkowski
Bartender
Posts: 10575
66

Winston Gutkowski wrote:Actually, if I'm being really picky, it probably only needs two: - content, and childList...

I had a quick look at the Wiki pages on tries yesterday, and they seem to bear me out on this.

In fact, they include an even "simpler" structure:where, instead of a List, each Node points to its "first" child; but that child may have "siblings".
Which suggests a structure like this, using a few words from your dictionary:
I have to admit, I like its elegance, but it does mean having to write all the "maintenance" logic yourself.

The advantage of using a List is that you can use its methods to, for example, find the child you want, or add or delete one; so my advice would be to stick to the way you're doing it right now.
I just thought you might be interested in an alternative way of doing things.

HIH

Winston

Piet Souris
Master Rancher
Posts: 2410
80
Alicia IS using a List, if I am correct.

Anyway, I've written my own version, according to what I understand of it.
I think it is about Alicia's version. but I added my own 'TrieEndOfWord class, to see if it works

Alicia, I think your methods are basically correct. Can you send your complete code (well, the relevant parts) so that we can test it?

Here is my version, with, given the input, the following output:

Finished! Entropies are:
Trienode: char = *, count = 6, level = 0, children = 2, entropy = 0.918296
Trienode: char = a, count = 4, level = 1, children = 3, entropy = 1.500000
Trienode: char = , count = 1, level = 2, children = 0, entropy = -0.000000
Trienode: char = b, count = 2, level = 2, children = 2, entropy = 1.000000
Trienode: char = , count = 1, level = 3, children = 0, entropy = -0.000000
Trienode: char = x, count = 1, level = 3, children = 1, entropy = -0.000000
Trienode: char = , count = 1, level = 4, children = 0, entropy = -0.000000
Trienode: char = c, count = 1, level = 2, children = 1, entropy = -0.000000
Trienode: char = d, count = 1, level = 3, children = 1, entropy = -0.000000
Trienode: char = , count = 1, level = 4, children = 0, entropy = -0.000000
Trienode: char = b, count = 2, level = 1, children = 2, entropy = 1.000000
Trienode: char = p, count = 1, level = 2, children = 1, entropy = -0.000000
Trienode: char = , count = 1, level = 3, children = 0, entropy = -0.000000
Trienode: char = q, count = 1, level = 2, children = 1, entropy = -0.000000
Trienode: char = , count = 1, level = 3, children = 0, entropy = -0.000000
BUILD SUCCESSFUL (total time: 0 seconds)

Code:

Alicia Perry
Ranch Hand
Posts: 66
Thank you both for your help and I apologize for the delay in responding.

Winston Gutkowski wrote:Actually, you've stored a bit more than that, and I wonder if some of the things that you store aren't redundant.

True, but honestly in this case I don't care, it's not really a programming exercise but the important thing is to find the correct results. Let's say it is more a "theoretical" exercise.

Winston Gutkowski wrote:Also: DON'T make instance variables public. EVER.

You're absolutely right, but for speed issues I preferred do it that way.

Winston Gutkowski wrote:Well, Java has a very useful structure called an IdentityHashMap, that allows you to quickly retrieve values for specific objects. So, you could either set up one for each value, or you could create a NodeInfo class that contains all "non-essential" information about a TrieNode, and store that in an IdentityHashMap<TrieNode, NodeInfo>.

Then, whenever you encounter a TrieNode and you want to know its "count", you simply look it up in the Map.
Some of these values (eg, 'count') could also probably be calculated while the Trie is being built; although I have to admit that it "smells" like a value that might be better calculated via a method as required - but TBH I don't know.

Probably I didn't use the most efficient data structure but, as I said before, the important thing is to find the correct results and Trie data structures it still allows me to work on the dictionary.

Piet Souris wrote:
Not sure about your last reply. The way I look at it is what I already wrote. Given that I have an 'a' at level 1, then according to me there are two possibilities: either we're dealing with the word 'a' or we are dealing with the word 'ab'. Two possibilities therefore, with an entropy of 1. Now, in order to get that entropy, we must calculate the p for every child, taking that childs 'count' into consideration.
And that's why I talked about a special kind of null-character, so that the formula would only have to look at the children, and not also to inspect the 'isEnd' variable.

I think you're right, you've convinced me.

Winston Gutkowski wrote:
In fact, they include an even "simpler" structure:

where, instead of a List, each Node points to its "first" child; but that child may have "siblings".

Thanks to the alternative, but I prefer to use my data structure. For me it's easier and I don't have much time to rewrite all the code.

Piet Souris wrote:Anyway, I've written my own version, according to what I understand of it.
I think it is about Alicia's version. but I added my own 'TrieEndOfWord class, to see if it works

Thanks. Now I look your code calmly.

For entropy of the dictionary I mean to calculate chain rule:

Piet Souris
Master Rancher
Posts: 2410
80
hi Alicia,

I know my chaining rules, but that is not my main problem. My problem is this:

What I read in some Wikipedia articles is that if we have some stochastic variable, X, with possible outcome {Ui} and chances {Pi}, then the 'entropy' H(X) is simply equal to -Sum(Pi * log(Pi)).
Okay, I understand that.

But underlying any random variable, there is a so called 'probabiiity space', where we are given the basic outcomes ("Sigma"), a set of events and some function P. Et cetera, well, the classical definitions of probability theory.

My question isL given this dictionary, what is my "Sigma", what "events" do I have, and what random variable are we talking about?

What I have done, in my example, was just coding all the words into some "handy" data structure, and what I defined was: "given some Node, what possible Nodes can I go from there". And my 'X' follwed suit.
For my 'entropy' of the whole dictionary, it simply boiled down to "select a word, what are the possible beginletters, with weighted probabilities, and calculating the 'H'".

However, I can't imagine that this is any usefull information. Therefore my question: what 'random variabe' are we dealing with? If that is too much to explain in short, can you point me to some articles about this?

And, finally, can you send me your work so far? I can then test what you have, that will greatly help me understand your code much better.

Greetings,
Piet

Alicia Perry
Ranch Hand
Posts: 66
Good news, now it all works! I corrected using the advice of Piet.
That is, the tree has the "empty" nodes to indicate the end of a word, then the entropy of each node changes.

Thank you so much :-)

Piet Souris
Master Rancher
Posts: 2410
80
hi Alicia,

well, you seem to be more enthousiastic about the usefulness of our help than I am, but glad we could help you!

Meanwhile, I've got some more thinking to do about this 'entropy' thingy...

Greetings,
Piet