• Post Reply Bookmark Topic Watch Topic
  • New Topic

MD5 Generation of large files

 
Mark Mescher
Ranch Hand
Posts: 34
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi out there,
I have to test if two files on two different systems are identical. I thought MD5 would be the right way for this.
Well, all works fine but generating a MD5-Hash for very large files is not very performant (you can drink a lot of coffee during that operation:-)).
OK my little question is if I generate the Hash only for example for the first 1024 (or 4096 or ...) Bytes of a file do you think this would be a unique hash than?
Thanx a lot
Bye
Mark
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Don't be silly, you could never be SURE the files are identical if you don't do a digest of the entire file.
However - you could efficiently decide the files are NOT identical if digests of selected parts are NOT identical. Which parts to choose would depend on where the files come from.
Naturally you are comparing the file lengths first, right?
Bill
 
Mark Mescher
Ranch Hand
Posts: 34
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,
sure I compare filename and length first, after that the md5. As I see a Hash of the first 1024 Bytes would be enough to be nearly unique, or not?
Mark
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
A hash is *never* unique. MD5 is done in a way that it's seen as impossible to *deliberately* create two files with the same hash, but there can't be a guarantee that two different files will have a different hash (after all, there are much more possible file contents than possible hash values).

So the only *reliable* way to compare two files is to do it byte by byte. Only if calculating and comparing a hash is much faster (because the bytes need to be send over a slow network, for example), it makes sense to first do a hash compare, and then only do a byte by byte compare if the hashs are equal.

If you are working locally, a byte to byte compare will be much faster than comparing the hash, anyway.
[ October 25, 2004: Message edited by: Ilja Preuss ]
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!