Win a copy of The Little Book of Impediments (e-book only) this week in the Agile and Other Processes forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Comparison algorithm

 
Maxim Katcharov
Ranch Hand
Posts: 113
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I've frequently used the standard comparison tools that use the lcs algorithm or compare whole lines against each other, and have found them somewhat lacking. Moved sections aren't handled well at all, and the output is usually hard to navigate.

So I've implemented my own algorithm for comparing documents. My algorithm handles moved sections somewhat gracefully and can run a comparison in between logarithmic and linear time (leaning towards linear).

I'd like to get some feedback on if others see this as useful, or advice on how it may be improved. If anyone's interested in having a look, you can find it at:

http://katcharov.dyndns.org:8080/

(it's on port 8080).

Clicking on a blue section will take you to its pair. Up to 100000 characters may be compared right now.

I don't know if this is the best forum for this (perhaps programming diversions would have been more appropriate?), but it seems to belong here more than elsewhere.
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't think this is the best forum for the question, but do you know tkdiff?
It's nicely marking small differnces in lines.
I'm not sure how it handles missing sections.
 
Maxim Katcharov
Ranch Hand
Posts: 113
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you, but unfortunately I couldn't get that program to work. I'm on windows and do not have diff installed. Diff uses lcs, to the best of my knowledge, so it handles moves quite poorly. I would have liked to have seen the interface that they used for it though.
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You're welcome to look at this code I munged from a Java translation from C. I'm doing comparing archived versions of web pages, fairly small, line at a time. Performance was not a high priority compared to output flexibility.

http://www.surfscranton.com/architecture/TextDiff.htm

Is your source available, too?
 
Maxim Katcharov
Ranch Hand
Posts: 113
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Interesting code. I also use a hashtable for a step of my program, but only one of my strings is hashed. That algorithm runs quickly, but isn't as detailed, it seems - I ran it on two versions of code that I had refactored, and while the changes were small (renamings), it picked up entire lines which amounted to quite a bit more than was changed.

Your zip was missing some of the files that should have been found in your .util package, but I got around that.

I'd like to release my code under a non-commercial license, but I'm unsure as to how novel the algorithm is (if at all). If it turns out to be, and is useful, I suppose I'd still have the chance to protect it against commercial use later on.

Any suggestions for a non-commercial license?


Oh, and I've updated that page ( http://katcharov.dyndns.org:8080/ ) to include an example of LCS output that may be compared against the output of my algorithm.
[ October 09, 2005: Message edited by: Maxim Katcharov ]
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic