• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Tim Cooke
  • Campbell Ritchie
  • Ron McLeod
  • Liutauras Vilda
  • Jeanne Boyarsky
Sheriffs:
  • Junilu Lacar
  • Rob Spoor
  • Paul Clapham
Saloon Keepers:
  • Tim Holloway
  • Tim Moores
  • Jesse Silverman
  • Stephan van Hulst
  • Carey Brown
Bartenders:
  • Al Hobbs
  • Piet Souris
  • Frits Walraven

Interesting question - how would you do this in Java

 
Ranch Hand
Posts: 56
Tomcat Server Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Question:
You have been given a large list of people’s names and have been asked to highlight all duplicates where the exact same name appears more than once, as well as highlight any possible duplicates where there may be misspellings of the same name. Describe an algorithm you could use that would produce the required output.

The first part is easy - I could easily find duplicates using a nested for loop but how would find duplicates where there MAY be misspellings??? Do you create hash values for the names and compare hash values - if the hash values are close (not sure how you check that) then there may be similarly but misspelled names?

Any help will be appreciated!
 
lowercase baba
Posts: 13018
66
Chrome Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
it would depend on what you consider 'similar' names. Is "Thomas" similar to "Tomas"? Is Fred similar to Frederick/Fredrick/Fredrik/Frederik?

I would say that perhaps you can define what is close enough...

For example,

1) if two names were identical but one letter off, as if the user made a typo:

Frederick vs. Fredrrick

2) Names that differ by having one letter removed:

Stacey vs Stacy

3) names that have two letters swapped (again, most likely a typo):

Brian vs Brain

4) you may also want to compile a list of 'aliases', and consider all equal:
Thomas = Tomas = Tom = Thom = Tommy

As to the first part, I wouldn't use a nested loop. I'd make a map with the name as the key and a counter as the value.
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I got involved with phonetic ("sounds like") matching of names for a legal services client - names being transcribed from recordings get lots of different spellings.

For a demonstration of this kind of matching take a look at my demo.

The Apache organization's CODEC toolkit has the code for Metaphone and other phonetic matching algorithms

Bill
 
Ranch Hand
Posts: 1296
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Another option would be to compute the Levenshtein distance between the two words and test whether they are sufficiently close.



See Levenshtein distance
 
Rancher
Posts: 4686
7
Mac OS X VI Editor Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Garrett Rowe wrote:Another option would be to compute the Levenshtein distance


I agree, the Levenshtein algorithm works better for my real world cases than even Metaphone.
 
Ranch Hand
Posts: 75
Eclipse IDE Oracle Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I've done a bit of reading after checking out this post, and found out about the Damerau-Levenshtein distance which I believe would be better suited.
Besides the three operations used in computing the Levenshtein distance - insertion, deletion and substitution - this one comes up with a fourth, the transposition of two adjacent characters, particularly useful in case of misspellings.

Cheers
Claudiu
 
Ranch Hand
Posts: 334
2
Netbeans IDE Tomcat Server Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Michael Labuschagne wrote:The first part is easy - I could easily find duplicates using a nested for loop but ....



People already addressed the interesting part.

A more efficient way to find duplicates would be to use a HashSet or a TreeSet.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
reply
    Bookmark Topic Watch Topic
  • New Topic