• Post Reply Bookmark Topic Watch Topic
  • New Topic

Binary search tree with two values per node  RSS feed

 
Ranch Hand
Posts: 80
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello,

I am looking to write a binary tree in which each node holds both a string and an int value. I want to sort via the string. The integer would strictly be used to record the number of times the string appears. Any suggestions on how to do this, or do you know of code that already exist similar to what I'm looking to do that cam be studied?

Thanks
 
Rancher
Posts: 1090
14
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Scotty Steven wrote:Hello,

I am looking to write a binary tree in which each node holds both a string and an int value. I want to sort via the string.

Thanks


You mean a 'String'? Remember Java is case-sensitive.

Scotty Steven wrote:

The integer would strictly be used to record the number of times the string appears.

Thanks


So that means the String's' can repeat and the tree can hold duplicates. You have not specified how you would handle inserts/updates/deletes of duplicate nodes. Does an equal node traverse the right subtree or does it traverse the left subtree of this node? How does a delete operation correct the count value in the affected nodes ( i.e for cases when you have count > 1 ). What is your algorithm ( let us know in plain text ) going to be?

Scotty Steven wrote:

Any suggestions on how to do this, or do you know of code that already exist similar to what I'm looking to do that cam be studied?

Thanks


What have you tried? Plus I think we require more details about your problem statement and we need to know your algorithm. Implementation comes much later.

Chan.



 
Scotty Steven
Ranch Hand
Posts: 80
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Chan wrote:

You mean a 'String'? Remember Java is case-sensitive.


Yes, I mean 'String'. I understand that Java is case-sensitive. The sentence I wrote was not in code, so I sometime fall back to my default English language which is also case-sensitive.

Chan wrote:

So that means the String's' can repeat and the tree can hold duplicates. You have not specified how you would handle inserts/updates/deletes of duplicate nodes.


No. I guess I wasn't clear enough when I said "The integer would strictly be used to record the number of times the string appears." If a duplicate String appears, instead of inserting the String that already exists, the count integer would advance one. So, the first time the word "programming" appears, it is inserted into the tree along with an integer which would hold the starting value of 1. The next time the word "programming" appears, when the search algorithm finds the duplication, it would trigger the count integer to advance to 2 and would not insert the duplicate into the tree. The compared Strings would be setup to not be case-sensitive.

Chan wrote:

Does an equal node traverse the right subtree or does it traverse the left subtree of this node? What is your algorithm ( let us know in plain text ) going to be?


Let's go with if greater than, go right. If lesser than, go left.

How does a delete operation correct the count value in the affected nodes ( i.e for cases when you have count > 1 ).


Fair enough. In this project, it will be insert only. This is strictly to keep track of the number of times a String appears in a text file that will be read by the program. No deletions will occur. However, I think a re-balancing of the tree should occur after a couple of insertions.

Scotty Steven wrote:


Any suggestions on how to do this, or do you know of code that already exist similar to what I'm looking to do that cam be studied?

Thanks

Chan wrote:

What have you tried? Plus I think we require more details about your problem statement and we need to know your algorithm. Implementation comes much later.

Chan
.

I haven't coded anything yet. I am trying to figure out how to make this work and looking for a starting point. I know how I'd like it to work, but can not quite figure out how to begin. As such, I decided to request help from people more knowledgeable then myself on the subject. I thought someone could point me towards my goal, or perhaps knew of existing code that I could use as a starting point, but would require reworking it to make it fit.
 
Chan Ag
Rancher
Posts: 1090
14
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Scotty Steven wrote:
Chan wrote:

So that means the String's' can repeat and the tree can hold duplicates. You have not specified how you would handle inserts/updates/deletes of duplicate nodes.


No. I guess I wasn't clear enough when I said "The integer would strictly be used to record the number of times the string appears." If a duplicate String appears, instead of inserting the String that already exists, the count integer would advance one. So, the first time the word "programming" appears, it is inserted into the tree along with an integer which would hold the starting value of 1. The next time the word "programming" appears, when the search algorithm finds the duplication, it would trigger the count integer to advance to 2 and would not insert the duplicate into the tree. The compared Strings would be setup to not be case-sensitive.



Now that makes sense. So with each insert, you only increment the count if the String is equal. With this the following would be moot.

Scotty Steven wrote:
Chan wrote:

Does an equal node traverse the right subtree or does it traverse the left subtree of this node? What is your algorithm ( let us know in plain text ) going to be?


Let's go with if greater than, go right. If lesser than, go left.




Cause you are simply incrementing the count, if you get an equal node. So you don't need to traverse any sub tree. Or is there something I haven't understood?

Scotty Steven wrote:
Chan Ag wrote:How does a delete operation correct the count value in the affected nodes ( i.e for cases when you have count > 1 ).


Fair enough. In this project, it will be insert only. This is strictly to keep track of the number of times a String appears in a text file that will be read by the program. No deletions will occur. However, I think a re-balancing of the tree should occur after a couple of insertions.




Cool. But delete ( regardless of whether you are going to decrement the count or delete the node altogether ) would also not be complicated cause the tree doesn't have duplicate nodes.

Scotty Steven wrote:
I haven't coded anything yet. I am trying to figure out how to make this work and looking for a starting point. I know how I'd like it to work, but can not quite figure out how to begin. As such, I decided to request help from people more knowledgeable then myself on the subject. I thought someone could point me towards my goal, or perhaps knew of existing code that I could use as a starting point, but would require reworking it to make it fit.


Have you tried this and this link already? Do you find them helpful?
 
Marshal
Posts: 56610
172
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Do you need a tree with Strings and counts? Or do you need a tree with Strings and counts of Strings?

I meant the two to be different. You may find a much easier solution if you work out that difference.
 
Scotty Steven
Ranch Hand
Posts: 80
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Here is the project in a nut shell. Some students have other students write papers for them. This practice has been going on for decades. I am theorizing that each writer has their own style and as such, repetitively use the same words over and over. What I am setting out to do is create a program that will count the writers use of each word in document, and compare that to other written papers by the same student. I'm theorizing that a pattern should start to appear and any deviation might indicate a change in authors.

So, as part of my research, I am looking at 100's or even thousands of papers, one at a time by multiple authors and extracting the word counts from each as part of the data collection stage of my thesis. As such, some of these papers will be ten's of thousands of words long (I will be starting with published authors to establish if the theory is worthy of further research). Therefore, searching the tree efficiently to populate the words and increment the word counts efficiently is important and therefore sorting is important.

Once the tree has been completely populated, the tree needs to be sorted from largest number of occurrences of a word to smallest.
 
Ranch Hand
Posts: 221
5
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
While I'm interested in the code implementation, I'd really like to see the results of your research. Sounds fascinating.
 
Bartender
Posts: 4181
22
IntelliJ IDE Java Python
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You need to look up a data structure for holding your Strings/Counts. Have you looked at Java's Collections API to see which type you think best fits your needs? There are two different base-interfaces you can start from, depending on the answer to Campbell's question (if I understood it correctly). Once you have the base interface chosen, then you can narrow down an implementation which does what you need - I think there is already a good implementation for you in either of the two branches you choose, and there are variations if you are working in a multi-threaded environment.
 
Campbell Ritchie
Marshal
Posts: 56610
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Steve Luke wrote: . . . Have you looked at Java's Collections API . . . ? . . .
No. no, no, don't look there, because I happen to know one page in that link has a little application which counts words in text. It is really long and complicated: about ten lines.
 
Campbell Ritchie
Marshal
Posts: 56610
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Earlier, I wrote: . . . about ten lines.
No, I was mistaken. even if you miss out comments and blank lines, it comes to twelve lines.

Now I know what the application is for, I would suggest you do the counting first and the sorting afterwards.
 
Campbell Ritchie
Marshal
Posts: 56610
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There are algorithms which people use to try and identify authors' styles. It is much more complicated than counting words, I am afraid.
 
Scotty Steven
Ranch Hand
Posts: 80
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
No. no, no, don't look there, because I happen to know one page in that link has a little application which counts words in text. It is really long and complicated: about ten lines.

Earlier, I wrote:
. . . about ten lines.
No, I was mistaken. even if you miss out comments and blank lines, it comes to twelve lines.


Okay. Got anything more specific? I have been educating myself on Binary Search trees, which seem to fit what I'm looking for. I've been teaching myself by playing with one that has to do with numbers and traversing the tree with pre-ordered,in-ordered, and post-ordered, but that is as far as I've gotten.

Now I know what the application is for, I would suggest you do the counting first and the sorting afterwards.


I agree.

There are algorithms which people use to try and identify authors' styles.


I am aware that there are more complicated was of doing this. They explore word use patterns, but ignore reuse of words. With students checking their work through grammar correction sites such as Grammerly.com, those style tests are becoming less reliable. My thesis is to show that a much simpler way exists; that the number of occurrences/some word count will also show the likelihood that a particular word belongs to an author.
 
Campbell Ritchie
Marshal
Posts: 56610
172
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Scotty Steven wrote: . . . Got anything more specific? . . .
Not unless you manage to find the little app called Freq or similar on the Map page.
. . . My thesis is to show that a much simpler way exists; that the number of occurrences/some word count will also show the likelihood that a particular word belongs to an author.
Oh, it's a thesis? Go for it!!

While I wasn't looking for the Map page, I noticed there is a sorted map interface, in the same “trail” of the Java Tutorials. I had quite forgotten about that; you might be able to use a sorted map to do the whole thing in one fell swoop. The more I read in the Java Tutorials, the easier the whole thing seems to be.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!