Win a copy of The Way of the Web Tester: A Beginner's Guide to Automating Tests this week in the Testing forum!

# Tough Question: Detect Partial Duplicates

Justin Filmer
Greenhorn
Posts: 27
Let's say I have two Strings...

What I would like to do is detect what percentage of words match between the two... I would LOVE to do sequence detection (http://59.108.48.12/proceedings/sigir/sigir2010/docs/p675.pdf), but I think that's too difficult.

This method is severely flawed, but at least it'll give me a start. Let's first find the words that are similar between the two:

11 words match
s1 has 18 words
s2 has 15 words
Use smaller of the two...
% match = 11/15 = 74% match

Remove common words (my,is,I,to,and). Resulting words...
name, Justin, like, code, have, fun
% match = 6/15 = 40% match

Is there already a Java function that could give me some percentage of partial duplicates? If not, how can I write an algorithm that can compare two strings and figure out what percentage they have in common?

Here's my pseudocodo with the approach that I don't like...

I would really appreciate better ideas or code snippets... Thanks!

Paul Clapham
Sheriff
Posts: 21443
33
I was about to post the standard procedure, which would be to put the words from the two sentences into two sets, then find their intersection.

But then I noticed you said you didn't like that. Why not?

Justin Filmer
Greenhorn
Posts: 27
Paul Clapham wrote:I was about to post the standard procedure, which would be to put the words from the two sentences into two sets, then find their intersection.

But then I noticed you said you didn't like that. Why not?

Please go ahead and post it - it'll definitely help - if not for this, then for something else.
The reason I don't like it is because my reasoning is somewhat flawed. If we're looking for partial duplicates between sentences or paragraphs (strings in general), we should be also looking at phrases and ORDER of phrases within the strings, not just the words. The algorigthm I proposed doesn't take this into account at all.

By my previous reasoning, these two sentences would be considered to have a 66% match, when in my mind they should have a 0% match:

It would be way better and much favorable in my mind to compare bunches of two or three word phrases between the two strings to see how many matches show up. However, that seems like a problem that is out of the scope of "beginning java". If that is indeed easily possible, please let me know how! :)

Paul Clapham
Sheriff
Posts: 21443
33
• 1
Well, I did post it. And you posted it too. The pseudocode, that is. That's what I was going to post too. (Except lines 3 and 4 of your pseudocode are unnecessary.)

But I think you have to stop now and figure out your requirements before you start looking for algorithms. And no, you don't really have your requirements firmed up yet. You provided an example where word order was significant. Here's another one:

I don't know what you would make of that, but at any rate it might help if you started generating all kinds of pairs and decided what the match was and also why. After a while you should be able to put pen to paper and write down your rules.

Justin Filmer
Greenhorn
Posts: 27
Good point Paul. Thumbs up. I posted the pseudocode to show that I am thinking about how to code it - I just don't have the Java experience to actually turn it into real code.

Anyway, I've been thinking about it and think that this kind of algorithm may not be bad:

Would you be able to help me code this? I'm thinking maybe using subStrings or something would help here.