Win a copy of Java Concurrency Live Lessons this week in the Threads forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Fastest way of calculating MD5  RSS feed

 
adi arrab
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I want to calculate MD5 values for a large Files I am using the following Code

public String Checker(File fi) throws NoSuchAlgorithmException,
FileNotFoundException {
MessageDigest md = MessageDigest.getInstance("MD5");
StopWatch stopWatch = new StopWatch();
InputStream is = new FileInputStream(fi);
byte[] buffer = new byte[4000];
int read = 0;
try {
stopWatch.start();
while ((read = is.read(buffer)) > 0) {
md.update(buffer, 0, read);
}
byte[] md5sum = md.digest();
BigInteger bigInt = new BigInteger(1, md5sum);
String output = bigInt.toString(16);
System.out.println("MD5 : " + output);
stopWatch.stop();
long s = stopWatch.getTime();
System.out.println("MD5 Time taken: " + s);
return output;
} catch (IOException e) {
throw new RuntimeException("Unable to process file for MD5", e);
} finally {
try {
is.close();
} catch (IOException e) {
throw new RuntimeException(
"Unable to close input stream for MD5 calculation", e);
}
}

}

For Million record file it takes 55 seconds


How can i increase the performance(ie decrease the processing time)

Any Suggestions or Code would help

Thanks take care
 
Joe Ess
Bartender
Posts: 9406
12
Linux Mac OS X Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Processing a million of anything will take some time.
How long does it take to run the file through md5sum? I'd think that's pretty much as fast as you'll get.
 
steve souza
Ranch Hand
Posts: 862
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How much time is IO taking, and how much time is the digest taking? You should time the IO sepeartely and see how long just IO takes. I am not that familiar with IO in java, however you should make sure your IO is buffered, and also ensure that you are using the fastest IO classes available.

Also, not sure what you are doing with this message digest when you are done. If you simply want to compare it to other files you receive, there may be faster tests you can do on the input files and only if these fail calculate the message digest (i.e. for example do the number of bytes, or how rows match the original?)
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes, try wrapping your FileInputStream in a BufferedInputStream.
 
Pat Farrell
Rancher
Posts: 4678
7
Linux Mac OS X VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You can just comment out the actualy MD5 calculation, what's left is the time to read the file.

In general, MD5 and SHA can calculate values much faster than IO.

Do the buffering suggestions mentioned upthread.
 
Peter Chase
Ranch Hand
Posts: 1970
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Recommending buffering the input stream seems wrong here, to me. The input is already being read in biggish chunks, using a 4000-byte buffer.

Experimenting with the size of this buffer would make more sense. I would hazard a guess that a bigger buffer might give slightly better performance. But measurement is the only way to know for sure.

Putting an additional buffer in the way, as with BufferedInputStream, seems to me unlikely to help. More likely, it will make it very slightly slower.
 
rajesh bala
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You mentioned you were reading a million record. Even if a record is around 0.5 Kb, its almost 500Mb.

So I tried reading a 500Mb file. And adding a simple BufferedInputStream with 16Kb brings down the response time by 1/3.

InputStream is = new BufferedInputStream(new FileInputStream(fi), 16000);
byte[] buffer = new byte[16000];

~Rajesh.B
 
Joe Ess
Bartender
Posts: 9406
12
Linux Mac OS X Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by rajesh bala:

So I tried reading a 500Mb file. And adding a simple BufferedInputStream with 16Kb brings down the response time by 1/3.


Did you try it just reading 16k at a time from an InputStream?
My guess is it would be pretty close. As Peter suggested before, the performance improvement is from the size of the buffer, because all BufferedInputStream does is duplicate the effort of reading a chunk at a time.
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator


I suggest you use a buffer size that matches file system allocation blocks, just to simplify. These are always powers of two - try 4096 for example.

Timing tests to determine the optimum gets tricky because the operating system and possibly the hard drive itself will be buffering large chunks of data.

Bill
 
Michael Bond
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ive actually heard of a way faster then 55 seconds.
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!