• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Rob Spoor
  • Tim Cooke
  • Junilu Lacar
Sheriffs:
  • Henry Wong
  • Liutauras Vilda
  • Jeanne Boyarsky
Saloon Keepers:
  • Jesse Silverman
  • Tim Holloway
  • Stephan van Hulst
  • Tim Moores
  • Carey Brown
Bartenders:
  • Al Hobbs
  • Mikalai Zaikin
  • Piet Souris

Finding Most Common Phrase Occurance In String?

 
Greenhorn
Posts: 27
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hey guys, this is an interesting problem.
Let's say I have a string like this:


I want to be able to pick out the two or three word phrase that occurs both most often and second-to-most often in the string, while ignoring common words such as "I","and","is", etc. In the example string I provided, the most common phrase returned by the method should be "coding algorithms" and the second-most common phrase returned should be "love writing code".

Any ideas / code samples on how to do this? I'm thinking first, remove the common words, then use some type of dictionary that keeps track of relative percentages for all consecutive phrases. Then pick the highest two percentages from the dictionary. Now, how can we actually turn that into Java code?
 
Justin Filmer
Greenhorn
Posts: 27
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thank you Ulf Dittmer for the fixing of my String! Anyone have any ideas for the alogrithm/coding aspect of the problem?
 
Marshal
Posts: 22450
121
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If you'd SearchFirst you'd find a few similar threads. In the last one I encountered I suggested separating the problem into two sub-problems. In your case that would be three:
1) get a count for the number of words
2) filter out some words (I, is, etc)
3) sort the remainder

1) is usually done by using a Map<String,Integer>, where the keys are the words and the values are the occurrences. Use a TreeMap for to ignore the case of the words.
2) can be done by having a Collection<String> (or Set<String>) with too-common words, then removing those from the map (map.keySet().removeAll(commonWords)).
3) can be done by adding all the Map.Entry objects into a List that you then sort using Collections.sort and a custom Comparator.

After those steps you can use the List to access the entries in the right order.
reply
    Bookmark Topic Watch Topic
  • New Topic