• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Devaka Cooray
  • Tim Cooke
Sheriffs:
  • Rob Spoor
  • Liutauras Vilda
  • paul wheaton
Saloon Keepers:
  • Tim Holloway
  • Tim Moores
  • Mikalai Zaikin
  • Carey Brown
  • Piet Souris
Bartenders:
  • Stephan van Hulst

Tough Question: Detect Partial Duplicates

 
Greenhorn
Posts: 27
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Let's say I have two Strings...


What I would like to do is detect what percentage of words match between the two... I would LOVE to do sequence detection (http://59.108.48.12/proceedings/sigir/sigir2010/docs/p675.pdf), but I think that's too difficult.

This method is severely flawed, but at least it'll give me a start. Let's first find the words that are similar between the two:

11 words match
s1 has 18 words
s2 has 15 words
Use smaller of the two...
% match = 11/15 = 74% match

Remove common words (my,is,I,to,and). Resulting words...
name, Justin, like, code, have, fun
% match = 6/15 = 40% match

Is there already a Java function that could give me some percentage of partial duplicates? If not, how can I write an algorithm that can compare two strings and figure out what percentage they have in common?

Here's my pseudocodo with the approach that I don't like...


I would really appreciate better ideas or code snippets... Thanks!
 
Marshal
Posts: 28304
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I was about to post the standard procedure, which would be to put the words from the two sentences into two sets, then find their intersection.

But then I noticed you said you didn't like that. Why not?
 
Justin Filmer
Greenhorn
Posts: 27
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Paul Clapham wrote:I was about to post the standard procedure, which would be to put the words from the two sentences into two sets, then find their intersection.

But then I noticed you said you didn't like that. Why not?



Please go ahead and post it - it'll definitely help - if not for this, then for something else.
The reason I don't like it is because my reasoning is somewhat flawed. If we're looking for partial duplicates between sentences or paragraphs (strings in general), we should be also looking at phrases and ORDER of phrases within the strings, not just the words. The algorigthm I proposed doesn't take this into account at all.

By my previous reasoning, these two sentences would be considered to have a 66% match, when in my mind they should have a 0% match:


It would be way better and much favorable in my mind to compare bunches of two or three word phrases between the two strings to see how many matches show up. However, that seems like a problem that is out of the scope of "beginning java". If that is indeed easily possible, please let me know how! :)
 
Paul Clapham
Marshal
Posts: 28304
95
Eclipse IDE Firefox Browser MySQL Database
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Well, I did post it. And you posted it too. The pseudocode, that is. That's what I was going to post too. (Except lines 3 and 4 of your pseudocode are unnecessary.)

But I think you have to stop now and figure out your requirements before you start looking for algorithms. And no, you don't really have your requirements firmed up yet. You provided an example where word order was significant. Here's another one:

I don't know what you would make of that, but at any rate it might help if you started generating all kinds of pairs and decided what the match was and also why. After a while you should be able to put pen to paper and write down your rules.
 
Justin Filmer
Greenhorn
Posts: 27
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Good point Paul. Thumbs up. I posted the pseudocode to show that I am thinking about how to code it - I just don't have the Java experience to actually turn it into real code.

Anyway, I've been thinking about it and think that this kind of algorithm may not be bad:


Would you be able to help me code this? I'm thinking maybe using subStrings or something would help here.
 
Evacuate the building! Here, take this tiny ad with you:
Smokeless wood heat with a rocket mass heater
https://woodheat.net
reply
    Bookmark Topic Watch Topic
  • New Topic