• Post Reply Bookmark Topic Watch Topic
  • New Topic

File comparison  RSS feed

 
fahad siddiqui
Ranch Hand
Posts: 85
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have a couple of text files to be compared. the size of the files in not much, around 10mb.
Would a file comparison by loading the complete data into hashmaps and then comparing be faster or something like reading line by line from file and comparing or something along these lines?

What would be an optimal way to do the comparison? i need to generate a third file which contains which lines would be removed and which would be added to make the older file similar to the newer file.

any suggestions?
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Are these lines in order by any kind of key? I'm guessing no from the description. Too bad. That's easy.

There are many good tools to do this kind of thing for source code or similar text files. If you google for "textdiff" you'll find lots. Mine was the 2nd hit just now.
 
John Melton
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Actually, a good way to do this is to cryptographically hash the file. The java.security.* package allows you to perform a SHA or MD5 hash on the file. There's been some recent research that shows these algorithms are not quite as secure as we once thought they were, but for your purposes, it should be just fine. Just google on "MD5 hash file java" or "SHA hash file java", and that should get you there.

http://uncc.dyndns.org
 
steve souza
Ranch Hand
Posts: 862
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hashing to whole file would tell you if the file differs, but wouldn't let you do a line by line comparison. If you have to do a line by line comparison anyways you might as well not do the hash.
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is an OLD problem in computing and not easily solved. Consider for example - if one file has a couple of lines inserted and some deleted, when comparing line by line you know when you hit the start of the difference, but HOW to resynchronize the areas where the files match?

A google search for "open source file difference" will get you lots of other people's solutions to the problem.

Bill
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!