# Find entropy of dictionary

posted 10 months ago

I have a trie data structure that stores a sequence of English words. For example, given these (non-sense) words, the dictionary is this:

The nodes in blue are those in which a word is finished.

In each node I saved:

the character that it represents the level at which the node is located a counter that indicates how many words "pass" for that node the entropy of the next character of the node.

I would find the entropy for each level of the tree and the total entropy of the dictionary.

This is a piece of class

And this is a piece of class

Now I would find the entropy of each level of the tree.

To do this I created the following method that creates two data structures (

If I run this method on the sample dictionary I get:

[lev 1] 10.355154029112995

[lev 2] 0.6557748405420764

[lev 3] 0.2127659574468085

[lev 4] 0.23925771271992619

[lev 5] 0.17744361708265158

[lev 6] 0.0

[lev 7] 0.0

[lev 8] 0.0

[lev 9] 0.0

[lev 10] 0.0

[lev 11] 0.0

[lev 12] 0.0

The problem is that I don't understand if the result is likely or completely wrong.

Another problem: I would like to calculate the entropy of the entire dictionary. To do so I thought I'd add the values present in levelArray. It's rigth a procedure like that? If I do that I obtain that the entropy of the entire dictionary is 11.64.

I need some advice. No wonder code, but I would understand if the solutions I proposed to solve the two problems are corrected.

The example I proposed is very simple. In reality these methods must work on a real English dictionary of about 200800 words. And if I apply these methods in this dictionary I get numbers like these (in my opinion are excessive).

Entropy for each level:

[lev 1] 65.30073504641602

[lev 2] 44.49825655981045

[lev 3] 37.812193162250765

[lev 4] 18.24599038562219

[lev 5] 7.943507700803994

[lev 6] 4.076715421729149

[lev 7] 1.5934893456776191

[lev 8] 0.7510203704630074

[lev 9] 0.33204345165280974

[lev 10] 0.18290941591943546

[lev 11] 0.10260282173581108

[lev 12] 0.056284946780556455

[lev 13] 0.030038717136269627

[lev 14] 0.014766733727532396

[lev 15] 0.007198162552512713

[lev 16] 0.003420610593927708

[lev 17] 0.0013019239303215001

[lev 18] 5.352246905990619E-4

[lev 19] 2.1483959981088307E-4

[lev 20] 8.270156797847352E-5

[lev 21] 7.327868866691726E-5

[lev 22] 2.848394217759738E-6

[lev 23] 6.6648152186416716E-6

[lev 24] 0.0

[lev 25] 8.545182653279214E-6

[lev 26] 0.0

[lev 27] 0.0

[lev 28] 0.0

[lev 29] 0.0

[lev 30] 0.0

[lev 31] 0.0

I think they are wrong. And the entropy of the entire dictionary can not be calculated, I think that the sum it's a process too long so as I'm not able to see the result.

For this I would understand if the methods that I have written are right.

I don't understand what is wrong.

Thanks a lot in advance

`aa abc aids aimed ami amo b browne brownfield brownie browser brut`

butcher casa cash cicca ciccio cicelies cicero cigar ciste conv cony

crumply diarca diarchial dort eserine excursus foul gawkishly he

insomniac mehuman occluding poverty pseud rumina skip slagging

socklessness unensured waspily yes yoga z zoobutcher casa cash cicca ciccio cicelies cicero cigar ciste conv cony

crumply diarca diarchial dort eserine excursus foul gawkishly he

insomniac mehuman occluding poverty pseud rumina skip slagging

socklessness unensured waspily yes yoga z zoo

The nodes in blue are those in which a word is finished.

In each node I saved:

I would find the entropy for each level of the tree and the total entropy of the dictionary.

This is a piece of class

*TrieNode*that rapresent a single node:And this is a piece of class

*Trie*with some methods to manipulate the trie data structure:Now I would find the entropy of each level of the tree.

To do this I created the following method that creates two data structures (

*level*and*levelArray*).*levelArray*is an array of double containing the result, that is, in the index i there is the entropy of the*i*-level.*level*is a arrayList of arrayList. Each list contains the weighted entropy of each node.If I run this method on the sample dictionary I get:

[lev 1] 10.355154029112995

[lev 2] 0.6557748405420764

[lev 3] 0.2127659574468085

[lev 4] 0.23925771271992619

[lev 5] 0.17744361708265158

[lev 6] 0.0

[lev 7] 0.0

[lev 8] 0.0

[lev 9] 0.0

[lev 10] 0.0

[lev 11] 0.0

[lev 12] 0.0

The problem is that I don't understand if the result is likely or completely wrong.

Another problem: I would like to calculate the entropy of the entire dictionary. To do so I thought I'd add the values present in levelArray. It's rigth a procedure like that? If I do that I obtain that the entropy of the entire dictionary is 11.64.

I need some advice. No wonder code, but I would understand if the solutions I proposed to solve the two problems are corrected.

The example I proposed is very simple. In reality these methods must work on a real English dictionary of about 200800 words. And if I apply these methods in this dictionary I get numbers like these (in my opinion are excessive).

Entropy for each level:

[lev 1] 65.30073504641602

[lev 2] 44.49825655981045

[lev 3] 37.812193162250765

[lev 4] 18.24599038562219

[lev 5] 7.943507700803994

[lev 6] 4.076715421729149

[lev 7] 1.5934893456776191

[lev 8] 0.7510203704630074

[lev 9] 0.33204345165280974

[lev 10] 0.18290941591943546

[lev 11] 0.10260282173581108

[lev 12] 0.056284946780556455

[lev 13] 0.030038717136269627

[lev 14] 0.014766733727532396

[lev 15] 0.007198162552512713

[lev 16] 0.003420610593927708

[lev 17] 0.0013019239303215001

[lev 18] 5.352246905990619E-4

[lev 19] 2.1483959981088307E-4

[lev 20] 8.270156797847352E-5

[lev 21] 7.327868866691726E-5

[lev 22] 2.848394217759738E-6

[lev 23] 6.6648152186416716E-6

[lev 24] 0.0

[lev 25] 8.545182653279214E-6

[lev 26] 0.0

[lev 27] 0.0

[lev 28] 0.0

[lev 29] 0.0

[lev 30] 0.0

[lev 31] 0.0

I think they are wrong. And the entropy of the entire dictionary can not be calculated, I think that the sum it's a process too long so as I'm not able to see the result.

For this I would understand if the methods that I have written are right.

I don't understand what is wrong.

Thanks a lot in advance

Piet Souris

Rancher

Posts: 1635

36

posted 10 months ago

hi Alicia,

long time no see! How are you?

First time I see a problem like this, so please bear with me.

Have you tested your classes and methods on a very small dictionary, say with {'a', 'ab', 'bb', 'c', 'ca' }, where you can manually calculate some outcomes?

Second, an entropy of 0 on level 6 doesn't surprise me. If I read the picture well, then every node at level 6 has at most one childnode, and so the entropy

is:

- sum {pi * log_2(pi)} = - 1 * log_2(1) = 0

Well, it'll be tomorrow evening before I have time to have another look, so hopefully some information theorist cold step in. If not, I'll be back.

Greetings,

Piet

long time no see! How are you?

First time I see a problem like this, so please bear with me.

Have you tested your classes and methods on a very small dictionary, say with {'a', 'ab', 'bb', 'c', 'ca' }, where you can manually calculate some outcomes?

Second, an entropy of 0 on level 6 doesn't surprise me. If I read the picture well, then every node at level 6 has at most one childnode, and so the entropy

is:

- sum {pi * log_2(pi)} = - 1 * log_2(1) = 0

Well, it'll be tomorrow evening before I have time to have another look, so hopefully some information theorist cold step in. If not, I'll be back.

Greetings,

Piet

posted 10 months ago

Hi Piet, all right, thanks, you? :-)

I created the dictionary you suggested me and I think I noticed a possible error.

I had already created simple dictionaries but I had never noticed this error.

Here is the figure:

The nodes of the path {*, A} and {*, C} should have entropy 0 and not 0.5. Right? In that case there isn't uncertainty, then the entropy should be 0.

Ok, I don't know how to fix it but I'm "happy" to have found it (if the error is this).

I created the dictionary you suggested me and I think I noticed a possible error.

I had already created simple dictionaries but I had never noticed this error.

Here is the figure:

The nodes of the path {*, A} and {*, C} should have entropy 0 and not 0.5. Right? In that case there isn't uncertainty, then the entropy should be 0.

Ok, I don't know how to fix it but I'm "happy" to have found it (if the error is this).

Piet Souris

Rancher

Posts: 1635

36

posted 10 months ago

Hi Alicia,

I downloaded a long Wikipedia article this morning, so that’ll give me some reading for tonight.

Meanwhile, if I look at our tiny dictionary, I wonder:

Given an ‘a’ at level 1 (so in fact I have a word that starts with an ‘a’), and I ask what possible outcomes I have one level down, I see the following possibilities:

1) The ‘empty’ character, meaning end-of-word

2) ‘b’

Both with probability 0.5.

If I define my own notation, I’d say that the ‘entropy’, given an ‘a’ at level 1, is:

- { 0.5 * log_2(0.5) + 0.5 * log_2(0.5) } = - log_2(0.5) = - (-1) = 1

So H(a, 1, 1) (is entropy, given an ‘a’ at level 1 and going 1 level deep) = 1.

Now, intuitively, I’d say that the entropy of the whole dictionary, over all levels, is simply equal to

- Sum ( 1 / N * log_2(1/N) = - log_2(1/N), with N being the number of words, AND assuming all words are equally likely.

Well, so far is what I had time for. Can you tell me if what I wrote makes any sense?

I downloaded a long Wikipedia article this morning, so that’ll give me some reading for tonight.

Meanwhile, if I look at our tiny dictionary, I wonder:

Given an ‘a’ at level 1 (so in fact I have a word that starts with an ‘a’), and I ask what possible outcomes I have one level down, I see the following possibilities:

1) The ‘empty’ character, meaning end-of-word

2) ‘b’

Both with probability 0.5.

If I define my own notation, I’d say that the ‘entropy’, given an ‘a’ at level 1, is:

- { 0.5 * log_2(0.5) + 0.5 * log_2(0.5) } = - log_2(0.5) = - (-1) = 1

So H(a, 1, 1) (is entropy, given an ‘a’ at level 1 and going 1 level deep) = 1.

Now, intuitively, I’d say that the entropy of the whole dictionary, over all levels, is simply equal to

- Sum ( 1 / N * log_2(1/N) = - log_2(1/N), with N being the number of words, AND assuming all words are equally likely.

Well, so far is what I had time for. Can you tell me if what I wrote makes any sense?

posted 10 months ago

Why both have probability 0.5? According to me, the (blank) " " character has probability 0 while "b" character has probability 1. In fact, once you are on the "*"->"a" node, you can only choose the character "b", you have no other alternatives.

In the dictionary there is no " " character, only a_z letters, so it makes no sense to assign a probability to a character that does not exist

I would say that the entropy, given an "a" at level 1, is 0 because there is not uncertainty as to which character to choose, since I can only choose "b".

If the "*"->"a" node had two children equally likely, then entropy would be 1.

Piet Souris wrote:Given an ‘a’ at level 1 (so in fact I have a word that starts with an ‘a’), and I ask what possible outcomes I have one level down, I see the following possibilities:

1) The ‘empty’ character, meaning end-of-word

2) ‘b’

Both with probability 0.5.

Why both have probability 0.5? According to me, the (blank) " " character has probability 0 while "b" character has probability 1. In fact, once you are on the "*"->"a" node, you can only choose the character "b", you have no other alternatives.

In the dictionary there is no " " character, only a_z letters, so it makes no sense to assign a probability to a character that does not exist

Piet Souris wrote:

If I define my own notation, I’d say that the ‘entropy’, given an ‘a’ at level 1, is:

- { 0.5 * log_2(0.5) + 0.5 * log_2(0.5) } = - log_2(0.5) = - (-1) = 1

So H(a, 1, 1) (is entropy, given an ‘a’ at level 1 and going 1 level deep) = 1.

I would say that the entropy, given an "a" at level 1, is 0 because there is not uncertainty as to which character to choose, since I can only choose "b".

If the "*"->"a" node had two children equally likely, then entropy would be 1.

posted 10 months ago

Actually, you've stored a bit more than that, and I wonder if some of the things that you store aren't redundant.

Also: DON'T make instance variables

Now, bear in mind that I'm not an expert on tries, so I'm going somewhat by 'intuition' here; but let me try to explain myself:

You have a

If I'm right, then emprically, a

1. The letter it represents.

2. A list of letters (ie nodes) that come

3. Something that indicates whether it represents the "end" of a word in the

This could be done either with a flag or, as Piet appears to suggest, with some special "null" node included in the "child" list. I have a slight preference for the first (ie, the way you've done it), but there may be reasons for doing it as Piet suggests that I don't know.

There may also be good reasons for storing other values (eg, efficiency), but I doubt they're actually needed to represent

Now let me take the values you have one-by-one:

So, assuming you've removed '

Well, Java has a very useful structure called an

Then, whenever you encounter a

Some of these values (eg, '

Hope it helps ... and also that if I'm wrong about anything, someone flames me.

Winston

Alicia Perry wrote:I have a trie data structure that stores a sequence of English words.

In each node I saved:

the character that it represents the level at which the node is located a counter that indicates how many words "pass" for that node the entropy of the next character of the node.

Actually, you've stored a bit more than that, and I wonder if some of the things that you store aren't redundant.

Also: DON'T make instance variables

`public`.

*EVER*.

Now, bear in mind that I'm not an expert on tries, so I'm going somewhat by 'intuition' here; but let me try to explain myself:

You have a

`Trie`class that contains a set of

`TrieNode`s; each of which represents the placement of

__one__letter in a word, based on words that

*the*has been supplied with so far. So, to me, the

`Trie``Trie`itself IS the "dictionary". It just isn't stored as an array of words.

If I'm right, then emprically, a

`TrieNode`needs only three things:

1. The letter it represents.

2. A list of letters (ie nodes) that come

*after*it in any word that has been processed.

3. Something that indicates whether it represents the "end" of a word in the

`Trie`, whether or not there are any

*longer*words with the same prefix.

This could be done either with a flag or, as Piet appears to suggest, with some special "null" node included in the "child" list. I have a slight preference for the first (ie, the way you've done it), but there may be reasons for doing it as Piet suggests that I don't know.

There may also be good reasons for storing other values (eg, efficiency), but I doubt they're actually needed to represent

*the structure*.

Now let me take the values you have one-by-one:

`char content;`- Absolutely required.

`boolean isEnd;`- Looks fine to me.

`double count;`- This strikes me as a "Trie" value, not one that needs to be stored in a

`TrieNode`(more on that below).

`LinkedList<TrieNode> childList;`- Absolutely required - although I think I might make it an

`ArrayList`to save space.

`public String path = "";`- This looks a derived value; probably part of a search; so I wouldn't include it.

`double entropyNextChar;`- Like 'count' above, this looks like a "Trie" value to me.

`int level;`- Possibly needed, but I suspect not. A '

`maxDepth`' value in your

`Trie`class might be a very good idea however.

So, assuming you've removed '

`count`' and '

`entropy`' and the rest from your

`TrieNode`, how do you store them?

Well, Java has a very useful structure called an

`IdentityHashMap`, that allows you to quickly retrieve values for specific objects. So, you could either set up one for each value, or you could create a

`NodeInfo`class that contains all "non-essential" information about a

`TrieNode`, and store

*that*in an

`IdentityHashMap<TrieNode, NodeInfo>`.

Then, whenever you encounter a

`TrieNode`and you want to know its "count", you simply look it up in the Map.

Some of these values (eg, '

`count`') could also probably be calculated

*while the*; although I have to admit that it "smells" like a value that might be better calculated via a method as required - but TBH I don't know.

`Trie`is being builtHope it helps ... and also that if I'm wrong about anything, someone flames me.

Winston

"Leadership is nature's way of removing morons from the productive flow" - Dogbert

Articles by Winston can be found here

posted 10 months ago

Actually, if I'm being

The only thing it's really needed for is if you want to convert the

Winston

Winston Gutkowski wrote:If I'm right, then emprically, aTrieNodeneeds only three things:

Actually, if I'm being

*really*picky, it probably only needs two: -

`content`, and

`childList`- since '

`isEnd`' is related more to the

`Trie`

*containing*the node.

The only thing it's really needed for is if you want to convert the

*structure*back to an array or

`List`of "words", or check that a supplied search string, is in fact, an actual

*word*in the dictionary (it could just be a prefix). It really doesn't have any meaning to an individual node though.

Winston

Articles by Winston can be found here

Piet Souris

Rancher

Posts: 1635

36

posted 10 months ago

@Alicia

Not sure about your last reply. The way I look at it is what I already wrote. Given that I have an 'a' at level 1, then according to me there are two possibilities: either we're dealing with the word 'a' or we are dealing with the word 'ab'. Two possibilities therefore, with an entropy of 1. Now, in order to get that entropy, we must calculate the p for every child, taking that childs 'count' into consideration.

And that's why I talked about a special kind of null-character, so that the formula would only have to look at the children, and not also to inspect the 'isEnd' variable.

(@Wiston: that was my sole intention for such an element).

But we do not necessarrily have to invent some special character, we could also have one (and only one) special TrieNode, indicating an 'end-of-word'.

Well, it is just a thought.

Now, before looking at what datastructure would be handy, let us first look at what kind of calculations do we need to perform..

And by the look of it, the TrieClass looks suitable to me.

So, given a character at some level, the chances would be: for each child j: j.childlist.size() / sum of all childlistsizes. (with possibly a correction if the 'isEnd' is used or not)). From this, the entropy can be deduced easily.

It makes therefore sense to have variable 'count' as member of a node, to avoid this summing at all the nodes.

But this is a 1 level deep calculation. What is meant by ; the entropy of the whole dictionary? And is this question also possible: given an 'a', what is its entropy (so, irrespective of the level of 'a').

@Alicia: can you inform us about what kind of questions are to be expected? And how do you define the entropy of the whole dictionary? That Wikipedia article was not clear on that aspect.

Not sure about your last reply. The way I look at it is what I already wrote. Given that I have an 'a' at level 1, then according to me there are two possibilities: either we're dealing with the word 'a' or we are dealing with the word 'ab'. Two possibilities therefore, with an entropy of 1. Now, in order to get that entropy, we must calculate the p for every child, taking that childs 'count' into consideration.

And that's why I talked about a special kind of null-character, so that the formula would only have to look at the children, and not also to inspect the 'isEnd' variable.

(@Wiston: that was my sole intention for such an element).

But we do not necessarrily have to invent some special character, we could also have one (and only one) special TrieNode, indicating an 'end-of-word'.

Well, it is just a thought.

Now, before looking at what datastructure would be handy, let us first look at what kind of calculations do we need to perform..

And by the look of it, the TrieClass looks suitable to me.

So, given a character at some level, the chances would be: for each child j: j.childlist.size() / sum of all childlistsizes. (with possibly a correction if the 'isEnd' is used or not)). From this, the entropy can be deduced easily.

It makes therefore sense to have variable 'count' as member of a node, to avoid this summing at all the nodes.

But this is a 1 level deep calculation. What is meant by ; the entropy of the whole dictionary? And is this question also possible: given an 'a', what is its entropy (so, irrespective of the level of 'a').

@Alicia: can you inform us about what kind of questions are to be expected? And how do you define the entropy of the whole dictionary? That Wikipedia article was not clear on that aspect.

posted 10 months ago

I had a quick look at the Wiki pages on tries yesterday, and they seem to bear me out on this.

In fact, they include an even "simpler" structure:where, instead of a List, each Node points to its "first" child; but that child may have "siblings".

Which suggests a structure like this, using a few words from your dictionary:

I have to admit, I like its elegance, but it does mean having to write all the "maintenance" logic yourself.

The advantage of using a

I just thought you might be interested in an alternative way of doing things.

HIH

Winston

Winston Gutkowski wrote:Actually, if I'm beingreallypicky, it probably only needs two: -content, andchildList...

I had a quick look at the Wiki pages on tries yesterday, and they seem to bear me out on this.

In fact, they include an even "simpler" structure:where, instead of a List, each Node points to its "first" child; but that child may have "siblings".

Which suggests a structure like this, using a few words from your dictionary:

I have to admit, I like its elegance, but it does mean having to write all the "maintenance" logic yourself.

The advantage of using a

`List`is that you can use

*its*methods to, for example, find the child you want, or add or delete one; so my advice would be to stick to the way you're doing it right now.

I just thought you might be interested in an alternative way of doing things.

HIH

Winston

Articles by Winston can be found here

Piet Souris

Rancher

Posts: 1635

36

posted 10 months ago

Alicia IS using a List, if I am correct.

Anyway, I've written my own version, according to what I understand of it.

I think it is about Alicia's version. but I added my own 'TrieEndOfWord class, to see if it works

Alicia, I think your methods are basically correct. Can you send your complete code (well, the relevant parts) so that we can test it?

Here is my version, with, given the input, the following output:

Code:

Anyway, I've written my own version, according to what I understand of it.

I think it is about Alicia's version. but I added my own 'TrieEndOfWord class, to see if it works

Alicia, I think your methods are basically correct. Can you send your complete code (well, the relevant parts) so that we can test it?

Here is my version, with, given the input, the following output:

`Finished! Entropies are:`

Trienode: char = *, count = 6, level = 0, children = 2, entropy = 0.918296

Trienode: char = a, count = 4, level = 1, children = 3, entropy = 1.500000

Trienode: char = , count = 1, level = 2, children = 0, entropy = -0.000000

Trienode: char = b, count = 2, level = 2, children = 2, entropy = 1.000000

Trienode: char = , count = 1, level = 3, children = 0, entropy = -0.000000

Trienode: char = x, count = 1, level = 3, children = 1, entropy = -0.000000

Trienode: char = , count = 1, level = 4, children = 0, entropy = -0.000000

Trienode: char = c, count = 1, level = 2, children = 1, entropy = -0.000000

Trienode: char = d, count = 1, level = 3, children = 1, entropy = -0.000000

Trienode: char = , count = 1, level = 4, children = 0, entropy = -0.000000

Trienode: char = b, count = 2, level = 1, children = 2, entropy = 1.000000

Trienode: char = p, count = 1, level = 2, children = 1, entropy = -0.000000

Trienode: char = , count = 1, level = 3, children = 0, entropy = -0.000000

Trienode: char = q, count = 1, level = 2, children = 1, entropy = -0.000000

Trienode: char = , count = 1, level = 3, children = 0, entropy = -0.000000

BUILD SUCCESSFUL (total time: 0 seconds)Trienode: char = *, count = 6, level = 0, children = 2, entropy = 0.918296

Trienode: char = a, count = 4, level = 1, children = 3, entropy = 1.500000

Trienode: char = , count = 1, level = 2, children = 0, entropy = -0.000000

Trienode: char = b, count = 2, level = 2, children = 2, entropy = 1.000000

Trienode: char = , count = 1, level = 3, children = 0, entropy = -0.000000

Trienode: char = x, count = 1, level = 3, children = 1, entropy = -0.000000

Trienode: char = , count = 1, level = 4, children = 0, entropy = -0.000000

Trienode: char = c, count = 1, level = 2, children = 1, entropy = -0.000000

Trienode: char = d, count = 1, level = 3, children = 1, entropy = -0.000000

Trienode: char = , count = 1, level = 4, children = 0, entropy = -0.000000

Trienode: char = b, count = 2, level = 1, children = 2, entropy = 1.000000

Trienode: char = p, count = 1, level = 2, children = 1, entropy = -0.000000

Trienode: char = , count = 1, level = 3, children = 0, entropy = -0.000000

Trienode: char = q, count = 1, level = 2, children = 1, entropy = -0.000000

Trienode: char = , count = 1, level = 3, children = 0, entropy = -0.000000

BUILD SUCCESSFUL (total time: 0 seconds)

Code:

posted 10 months ago

Thank you both for your help and I apologize for the delay in responding.

True, but honestly in this case I don't care, it's not really a programming exercise but the important thing is to find the correct results. Let's say it is more a "theoretical" exercise.

You're absolutely right, but for speed issues I preferred do it that way.

Probably I didn't use the most efficient data structure but, as I said before, the important thing is to find the correct results and Trie data structures it still allows me to work on the dictionary.

I think you're right, you've convinced me.

Thanks to the alternative, but I prefer to use my data structure. For me it's easier and I don't have much time to rewrite all the code.

Thanks. Now I look your code calmly.

For

Winston Gutkowski wrote:Actually, you've stored a bit more than that, and I wonder if some of the things that you store aren't redundant.

True, but honestly in this case I don't care, it's not really a programming exercise but the important thing is to find the correct results. Let's say it is more a "theoretical" exercise.

Winston Gutkowski wrote:Also: DON'T make instance variablespublic.EVER.

You're absolutely right, but for speed issues I preferred do it that way.

Winston Gutkowski wrote:Well, Java has a very useful structure called anIdentityHashMap, that allows you to quickly retrieve values for specific objects. So, you could either set up one for each value, or you could create aNodeInfoclass that contains all "non-essential" information about aTrieNode, and storethatin anIdentityHashMap<TrieNode, NodeInfo>.

Then, whenever you encounter aTrieNodeand you want to know its "count", you simply look it up in the Map.

Some of these values (eg, 'count') could also probably be calculatedwhile the; although I have to admit that it "smells" like a value that might be better calculated via a method as required - but TBH I don't know.Trieis being built

Probably I didn't use the most efficient data structure but, as I said before, the important thing is to find the correct results and Trie data structures it still allows me to work on the dictionary.

Piet Souris wrote:

Not sure about your last reply. The way I look at it is what I already wrote. Given that I have an 'a' at level 1, then according to me there are two possibilities: either we're dealing with the word 'a' or we are dealing with the word 'ab'. Two possibilities therefore, with an entropy of 1. Now, in order to get that entropy, we must calculate the p for every child, taking that childs 'count' into consideration.

And that's why I talked about a special kind of null-character, so that the formula would only have to look at the children, and not also to inspect the 'isEnd' variable.

I think you're right, you've convinced me.

Winston Gutkowski wrote:

In fact, they include an even "simpler" structure:

where, instead of a List, each Node points to its "first" child; but that child may have "siblings".

Thanks to the alternative, but I prefer to use my data structure. For me it's easier and I don't have much time to rewrite all the code.

Piet Souris wrote:Anyway, I've written my own version, according to what I understand of it.

I think it is about Alicia's version. but I added my own 'TrieEndOfWord class, to see if it works

Thanks. Now I look your code calmly.

For

*entropy of the dictionary*I mean to calculate chain rule:
Piet Souris

Rancher

Posts: 1635

36

posted 10 months ago

hi Alicia,

I know my chaining rules, but that is not my main problem. My problem is this:

What I read in some Wikipedia articles is that if we have some stochastic variable, X, with possible outcome {Ui} and chances {Pi}, then the 'entropy' H(X) is simply equal to -Sum(Pi * log(Pi)).

Okay, I understand that.

But underlying any random variable, there is a so called 'probabiiity space', where we are given the basic outcomes ("Sigma"), a set of events and some function P. Et cetera, well, the classical definitions of probability theory.

My question isL given this dictionary, what is my "Sigma", what "events" do I have, and what random variable are we talking about?

What I have done, in my example, was just coding all the words into some "handy" data structure, and what I defined was: "given some Node, what possible Nodes can I go from there". And my 'X' follwed suit.

For my 'entropy' of the whole dictionary, it simply boiled down to "select a word, what are the possible beginletters, with weighted probabilities, and calculating the 'H'".

However, I can't imagine that this is any usefull information. Therefore my question: what 'random variabe' are we dealing with? If that is too much to explain in short, can you point me to some articles about this?

And, finally, can you send me your work so far? I can then test what you have, that will greatly help me understand your code much better.

Greetings,

Piet

I know my chaining rules, but that is not my main problem. My problem is this:

What I read in some Wikipedia articles is that if we have some stochastic variable, X, with possible outcome {Ui} and chances {Pi}, then the 'entropy' H(X) is simply equal to -Sum(Pi * log(Pi)).

Okay, I understand that.

But underlying any random variable, there is a so called 'probabiiity space', where we are given the basic outcomes ("Sigma"), a set of events and some function P. Et cetera, well, the classical definitions of probability theory.

My question isL given this dictionary, what is my "Sigma", what "events" do I have, and what random variable are we talking about?

What I have done, in my example, was just coding all the words into some "handy" data structure, and what I defined was: "given some Node, what possible Nodes can I go from there". And my 'X' follwed suit.

For my 'entropy' of the whole dictionary, it simply boiled down to "select a word, what are the possible beginletters, with weighted probabilities, and calculating the 'H'".

However, I can't imagine that this is any usefull information. Therefore my question: what 'random variabe' are we dealing with? If that is too much to explain in short, can you point me to some articles about this?

And, finally, can you send me your work so far? I can then test what you have, that will greatly help me understand your code much better.

Greetings,

Piet