• Post Reply Bookmark Topic Watch Topic
  • New Topic

Duplicate file (contain of the file, not the name!) checking  RSS feed

 
Biswajit Paria
Ranch Hand
Posts: 46
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,
Can you please help me to find out the solution for checking duplicate file contain?
i.e if a file called one.txt and the other file called two.txt contain the same data, then how can we check that?
Any Algorithm? Any suggestion?
Please response.

Regards,
Biswajit.
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'd probably use a stream of some sort to read byte arrays and compare one byte at a time. Check that the file length is equal first.

If you want to ignore the difference between Unix & Windows newlines you could use a reader instead of a stream and read lines, or just skip all \n and \r when comparing bytes. If you ignore these, the length may not match.

If high speed is a requirement, try both and see which is faster in your own environment. JDK version, OS, disk hardware may make a difference.
 
Biswajit Paria
Ranch Hand
Posts: 46
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Stan James .
Now I am describing the scenario where I have to check the file duplicity.
I upload a file in client site(internet-explorer) and sent to server and again when I upload any file that contain the same date, then at server side some validation should be imposed such that the latter uploaded file can not be processed further.

One solution that I have found is not appropriate.
The solution is as following.
If I generate the hash-key for this uploaded file from One-Way hash Algorithm and store the hash-key in database and next time when I upload any other file I generate hash-key and compare all hash-key stored in database to confirm that no any file (those were uploaded previously) contain the same data.
But it is not appropriate as two different files may generate same hash key.

So any one can please give some suggestion on this issue?

Regards,
Biswajit
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I believe using a hash as you describe is appropriate, but it's not the complete solution. When you upload a new file, you need to check it against all the existing files to see if it duplicates any of them. Using a hash allows you to quickly and efficiently eliminate the vast majority of files, without having to reread them and compare bytes one at a time. If two files have different hashes, they are different, period. However in the case that two hashes are identical, then you probably have to compare bytes in those two files. I would probably use NIO for efficiency:

[ November 18, 2004: Message edited by: Jim Yingst ]
 
Joseph Maddison
Ranch Hand
Posts: 53
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is what MD5 (Message Digest) is for. See http://java.sun.com/j2se/1.4.2/docs/api/java/security/MessageDigest.html for more details, but in a nutshell, it generates a 16 byte array based on the contents of the data passed into it. The range of possible values means that the likelihood of getting a duplicate MD5 for different files is in the billions to one, if not higher.

If you are working with a small set of files, a Hashtable (or better - a HashMap) should work just fine, using the digest as the key and the value.

Hope this helps,
Joseph

P.S. Be sure to convert the array of bytes to a String and use that for the key, so that the Hashtable's hashing works properly.

[ December 03, 2004: Message edited by: Joseph Maddison ]
[ December 03, 2004: Message edited by: Joseph Maddison ]
 
Don't get me started about those stupid light bulbs.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!