Win a copy of Functional Reactive Programming this week in the Other Languages forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

MD5sum performance

 
Arjun Shastry
Ranch Hand
Posts: 1899
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,
MD5Sum which is built in Linux gives 32 digits hash. We are planning to use this function to compare if file is corrupt.i.e. comparing same file on two machines using Md5. But Md5Sum appears to take long time. Program needs to compare almost 1.5 million files. Is there any efficient way to compare the equality of files?
 
Tony Docherty
Bartender
Posts: 2989
59
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It really depends on how accurately you want to compare them for equality. A very rough but very quick approach would be to compare the file sizes whereas a very accurate but very slow approach would be to compare each byte.

Note If two files have the same MD5 hash it does not guarantee the files are the same, it just means that both files hash to the same value. Having said that it's very unlikely you have two different files with the same hash value although it is possible to generate a file that will hash to the same value as an existing file.
 
Dieter Quickfend
Bartender
Posts: 543
4
Java Netbeans IDE Redhat
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
you mean you're generating your MD5's at compare time? If you'd generated them when saving the files, you wouldn't be in this mess. Change your save code to save a checksum and run a script to save all current files' checksums, then just compare the checksums at runtime, and you're fine.

Anywho, try spawning as many threads as possible comparing files, as long as your garbage collector isn't taking more timeslice than your process, you can optimize speed. The more detail in comparison you want, the slower it's going to get.

As Tony said, file size comparison would be quite fast. If you do size comparison and on same-sized files you compare bytes, that would likely be a very fast way to go.
 
Tony Docherty
Bartender
Posts: 2989
59
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Dieter Quickfend wrote:Anywho, try spawning as many threads as possible comparing files, as long as your garbage collector isn't taking more timeslice than your process, you can optimize speed. The more detail in comparison you want, the slower it's going to get.

I wouldn't go mad creating loads and loads of threads, you are likely to very quickly become bogged down in I/O blocks unless the files are on many different drives. You need to do some serious timing tests to see how many threads is optimum for your system before just throwing threads at the problem.
 
Arjun Shastry
Ranch Hand
Posts: 1899
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks. Creating many threads is one option but need to see its effect on performance. Storing MD5SUM/total bytes in cache/some table and comparing it during run time might also be another option i m thinking of.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic