No BSD License  

Highlights from
Using the MD5 Hash for Duplicate File Deletion

5.0

5.0 | 1 rating Rate this file 10 Downloads (last 30 days) File Size: 4.28 KB File ID: #5393

Using the MD5 Hash for Duplicate File Deletion

by Michael Kleder

 

01 Jul 2004 (Updated 15 Dec 2004)

This function uses an MD5 hash to rapidly detect and delete duplicate files in a directory.

| Watch this File

File Information
Description

This function rapidly compares large numbers of files for identical content by computing the MD5 hash of each file and detecting duplicates. The probablility of two non-identical files having the same MD5 hash, even in a hypothetical directory containing as many as a million files, is exceedingly remote. Thus, since hashes rather than file contents are compared, the process of detecting duplicates is greatly accelerated.

You must have the file md5DLL.dll on your MATLAB path to use this function. The function is stored in the MATLAB Central File Exchange, file #3784, and was written by Hans-Peter Suter. The URL for the download site is:
http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=3784&objectType=file

This function is intended for MS Windows operating systems. This is because MATLAB requires on the order of 0.1 seconds to execute an operating system command to delete each file, but can rapidly create and run an operating system batch file to perform the file deletions much faster. Since I use MATLAB on a Windows PC, this function creates a batch file for that platform. Futhermore, the md5DLL.dll file is specific to Windows.

Acknowledgements
This submission has inspired the following:
Directory Traversal and Duplicate File Deletion using the SHA-256 Hash
MATLAB release MATLAB 5.3.1 (R11.1)
Other requirements Windows Operating Systems
Tags for This File  
Everyone's Tags
Tags I've Applied
Add New Tags Please login to tag files.
Comments and Ratings (2)
05 Nov 2005 A. Belani

-It may help to include in both the webpage and the function description that files with same sizes are checked first, before their hashes are computed.
-It may help to optionally simply provide a list of the duplicates along with the original files, instead of deleting them.
-It may help to provide an option where the duplicate file with the older date gets deleted.

23 Aug 2008 Jakob Kleinbach

Thanks for this solution, which I use more and more often.

I'd like to share another favourite to retrieve duplicate files (via their MD5 sum) on the DOS command line:

First, you need these helpers from the GNU GnuWin32 Packages: find, md5sum, sort, uniq, sed, xargs (http://gnuwin32.sourceforge.net/packages.html). Put these somewhere, e.g. C:\msys\bin.

Secondly, create a *.bat file with the following:
@echo off
if "%1"=="" goto Usage
set MSYSDIR=C:\msys\bin
PUSHD %MSYSDIR%
.\find %1 -type f -print0 | .\xargs -0 -n1 .\md5sum | .\sort -k 1,32 | .\uniq -w 32 -d -D | .\sed -r 's/^[\\][0-9a-f]*( )*(\*)/\.\/find "/;s/[\]+/\\\/g;s/\//\\\/g;s/$/" -printf "%%s\\\t%%f"/' | cmd.exe /Q /K | .\sed -r 's/.*^>//' | .\sort -n
POPD
goto End
:Usage
echo Usage: %0 PATH Path must not contain blanks
:End

Assuming your batch file is named fd.bat, you can check c:\windows recursively for duplicate files with

C:\> fd.bat c:\windows

This is adapted from some bash scripts i have found and works pretty well for me.

Please login to add a comment or rating.
Updates
15 Dec 2004

Bug, caught by Peter Blumen. Thanks Peter!

Tag Activity for this File
Tag Applied By Date/Time
path Michael Kleder 22 Oct 2008 07:26:04
directories Michael Kleder 22 Oct 2008 07:26:04
files Michael Kleder 22 Oct 2008 07:26:04
md5 Michael Kleder 22 Oct 2008 07:26:04
hash Michael Kleder 22 Oct 2008 07:26:04
duplicate Michael Kleder 22 Oct 2008 07:26:04
file Michael Kleder 22 Oct 2008 07:26:04
delete Michael Kleder 22 Oct 2008 07:26:04

Contact us at files@mathworks.com