• Post Reply Bookmark Topic Watch Topic
  • New Topic

searching a large list of words

 
John Wickerson
Greenhorn
Posts: 13
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi guys,

I have a large text file containing 216556 words, one per line, in alphabetical order.

I want to make a procedure that, given a word (in String or char[] form, I don't mind), returns true if that word is in the list, false otherwise.

Firstly, would it be feasible to store this in memory, perhaps as some sort of array, or maybe as some sort of tree? Or will I have to keep it on disk? And if I keep it as it is, on disk as a text file, is there a way to jump straight in to the 123rd line, or do I have to search from the beginning of the text file for each word I want to check?

Thanks guys - much appreciated!

John
 
Paul Clapham
Sheriff
Posts: 21875
36
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The simplest solution is to read the words in and store them in a HashSet of strings. You only have a couple of megabytes of data so there shouldn't be a memory problem. Then you can use the set's contains() method to find if another string is in the set or not.
 
Edwin Dalorzo
Ranch Hand
Posts: 961
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If the file is not dynamically changing, and since the words are order, another option could be to use a binary search.

You already know how many words there are, do successive divisions until you find the word you are looking for.
 
John Wickerson
Greenhorn
Posts: 13
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
great, thanks for your suggestions guys.

Could I pick your brains further?

What if I now wanted to check if there were any words *beginning* in a certain string. For example, "AP" should give true (since APPLE is a word), while "XJ" should return false.

I've thought about storing the words in some sort of tree, with up to 26 branches per node, one per letter. But this seems a rather complicated approach, and I'm not sure I'm up for it!

Is there a method in HashSet perhaps, like contains(), that would be able to efficiently find words starting with a certain string?

Thanks ever so much,

John
 
Paul Clapham
Sheriff
Posts: 21875
36
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Nope. If you want "starts-with" then a Set isn't going to do it. The binary search idea would work, though, because the Arrays.binarySearch() method will return a pointer to where you would insert the string if it isn't already in the array. You would look at the entry after that to see if your string is a substring of it. Or a tree would work well too.

I'm not going to go into details because I dislike the development process wherein the user keeps the requirements secret and only deals them out after you have a design which isn't compatible with them. It's terribly inefficient. A better way to do this would be to start by putting down ALL your requirements, and then work on a design.
 
John Wickerson
Greenhorn
Posts: 13
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
thanks
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
For real speed, look into a Ternary Search Tree. The applet in this article takes a while to load, but it can find words as fast as you type. Pretty cool.
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
[Paul C]: If you want "starts-with" then a Set isn't going to do it.

A plain Set won't, but a SortedSet (typically a TreeSet) can. The tailSet() method would return a sorted subset which begins with either the word you request, or the first word after that. So if you have a TreeSet filled with words from the dictionary, treeSet.tailSet("zebr") will return a SortedSet that (probably) begins with "zebra".

The ternary search tree which Stan linked to may well be a better option in the long run. But TreeSet does have the benefit of being already implemented as part of the JDK.
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
For starts with: TST, TST, TST! Ok, enough cheeleading. Check it out.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!