• Post Reply Bookmark Topic Watch Topic
  • New Topic

Search a document  RSS feed

 
Sid Scud
Ranch Hand
Posts: 32
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,
the problem i am facing is that, i have to search a word/text document for certain keywords. For every keyword, I am launching a new thread to search in the docuemnt. But the time consumed for a word docuemnt of size 103 KB is around 5 or 6 sec.
Please give some suggestions.
Thanks in advance
-sid
 
Peter den Haan
author
Ranch Hand
Posts: 3252
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The easiest thing to do is probably to check for inefficiencies in your implementation. For a start, since you're unlikely to be I/O bound, spawning threads will not help you. In fact, if the number of keywords to search for is large, it will slow you down. Also, examine carefully the number of intermediate objects (esp Strings) you create - your execution time could easily be dominated by the garbage collector.
Failing that, try this. For large amounts of text or large number of keywords, your algorithm is flawed. You are doing a simple linear search for each keyword. If you have N words in your text and M keywords, the time taken will be O(N*M). You can do better than that.
The idea is this. Put the keywords in a hash table. Creating the table takes O(M) time, while searching for a keyword in the hash table takes approximately a constant (O(1)) time. Then go through your text, chop it up in words, and search each word in the keyword hashtable. The total time spent is O(M) + O(N)*O(1) = O(M) + O(N), giving you far better scalability.
Beware though - it is probably best not to use a standard Hashtable or HashMap, as they would force you to create an intermediate String for every word in your text. With your own hash table, you don't have to physically extract each word from the document, but can simply determine its hash code instead. (This will probably read like gibberish unless you're familiar with hashing... sorry).
- Peter

[This message has been edited by Peter den Haan (edited May 20, 2001).]
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!