Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Duplicate Files Remover

 
Greenhorn
Posts: 21
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Report post to moderator
first of all LOL at this forum


Anyway, Im interested in GUI tweaking, mainly user friendliness and related principles.


I need to demonstrate this on a duplicate files remover/scanner program


Ive done some work into it. Some information you need to know if you want to help me is:
The program needs to be a Java application
It will run in a JRE
its a standalone program

Ok, let me go through the intended or expected end user phases:
User loads program, its comes on screen
User clicks can, application then scans local disk
Files are deleted automatically (hopefully displated)

Please let me in on some information about how to better incorporate a Md5 checksum to a GUI

I want to learn my exaple or snippet ideally, nothing too complicted please because i want to know exactly what each code chunk does.

thanks

 
Sheriff
Posts: 22781
131
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Report post to moderator
First thing is determining how you can figure out file equality. A basic approach is checking the file length first; if they are equal compare the full contents.

You can use File.listFiles for browsing through your hard disk.


Now I've done something similar, and here's my approach:
- use a Map<Long,List><File>> that stores the unique files per file size.
- when you encounter a file, get the List<File> for its length.
- compare the file with all elements of the List<File> (if not null)
- if it is equal to any of the files process it, otherwise add it to the list (create it and add it to the list if it was null)

The latter part in pseudo code:
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Report post to moderator

Please let me in on some information about how to better incorporate a Md5 checksum to a GUI



I dont see what the use of MD5 or any other checksum to determine file equality has to do with a GUI. Perhaps you could have a dialog which gives a choice of equality checking methods but thats about it. Surely nobody needs to know the numeric value.

Bill
 
Bartender
Posts: 1166
17
Netbeans IDE Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Report post to moderator
Many years ago I wrote a program to deal with duplicate files on a disk and I still use it. My basic approach is to process a collection of 'roots' that will be scanned looking for duplicates. Starting at the roots I recursively visit the file tree creating a map using the SHA1 digest (MD5 will do just as well) of the file content as key with a Set of file names as values. Files with the same content will produce the same digest but of course it is possible that two files with different content will also have the same digest BUT in the 10 years or so I have used the program it has never found two different files with the same digest.

In my first version I too wanted to automatically delete duplicates but I soon found that there are difficulties with doing this. First, given two or more files that have the same content which one(s) do you delete? Second, especially in HTML, it is frequently better to have duplicate files than to try to cross link the HTML sources. To get round these problems I allow the user to specify the minimum file length to consider and to be presented with a list of duplicate files and the user selects which one(s), if any, to delete.

Obviously taking the SHA1 digest means one has in principle to read the whole file. I found this to be unnecessary and ended up taking the digest of the first 1,000,000 bytes. I do have a paranoid check; if I find two file with the same digest I then check for absolute equality of the whole of the file content. This file content comparison normally takes very little time since one first check for the same length and only if the two files are not the same length does one go further.

The GUI is not complicated. Just a file selection system to select the roots, a JCombo size selector and a JTable to present the results. On selecting a duplicate one is given the option to delete it or move it to a backup area.

 
Saloon Keeper
Posts: 10687
85
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows ChatGPT
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Report post to moderator
Basing your comparison on checksums has two problems: 1) it is possible (though remotely) that two files with the same checksum are not identical, and 2) computing checksums takes far longer than doing a byte-by-byte comparison because a byte-by-byte comparison can bail out as soon as two bytes are not the same, it doesn't (usually) need to read the entire file like the checksum approach would.

I created two utilities for myself, one where you specify two directory roots, one for comparison and one for deletion, and another program that takes a list of one or more roots but presents the list of duplicates to the user to identify which ones to delete.

One of the gui parameters in both programs is the path requirement: (none) don't care what the path is, (file) where the file names must match, and (dir/file) where both the file name and its parent directory name must match, etc..
 
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Report post to moderator

Carey Brown wrote:Basing your comparison on checksums has two problems: 1) it is possible (though remotely) that two files with the same checksum are not identical, and 2) computing checksums takes far longer than doing a byte-by-byte comparison...


Actually, not by much, since they generally only involve maths or binary operations on the bytes/characters in sequence, which is likely to be far quicker than actually reading them.

And I'd add a third problem: It's tough to know if anyone else is updating one of your files while you're doing the check. If Java has that sort of capability, I've never heard about it - or used it.

It may also be worth pointing out that this thread is from 2009; so I suspect Elvis has left the building.

Winston
 
Rancher
Posts: 1044
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Report post to moderator
Some time ago I also wrote such a program and am still using it.
Mine has no GUI but a CUI: it takes as arguments the directory names to peruse.
It uses md5 digest and before deleting a file it does a byte-for-byte check against the one to keep.

The question of which one to keep arose naturally.
The program keeps the files in the file system Z: which is the NAS device on all computers in my modest local network,
and favors to keep the file with the longer path name, that is, the one in a more elaborated directory.

Once wifey's computer's hard disk became defective and unusable by the operating system, and good old TestDisk and PhotoRec
(merci beaucoup, Christophe GRENIER!) saved the very files but without the directory structure.
Then I learned that she had moved the files into an elaborate directory scheme, the loss of which was painful for her,
even though the files got rescued. The whole operation took several days (including nights for the program to run).
 
    Bookmark Topic Watch Topic
  • New Topic