File Exchange

image thumbnail

DataHash

version 1.7.1 (12 KB) by Jan
MD5 or SHA hash for array, struct, cell or file

67 Downloads

Updated 19 May 2019

View License

DATAHASH - Hash for Matlab array, struct, cell or file
Hash = DataHash(Data, Opts, ...)
Data: Array of built-in types (U)INT8/16/32/64, SINGLE, DOUBLE (real or complex)
CHAR, LOGICAL, CELL, STRUCT (scalar or array, nested), function_handle.
Options: List of char vectors:
Hashing method: 'SHA-1', 'SHA-256', 'SHA-384', 'SHA-512', 'MD2', 'MD5'.
Output format: 'hex', 'HEX', 'double', 'uint8', 'base64'
Input type:
'array': The contents, type and size of the input [Data] are
considered for the creation of the hash. Nested CELLs
and STRUCT arrays are parsed recursively. Empty arrays of
different type reply different hashs.
'file': [Data] is treated as file name and the hash is calculated
for the files contents.
'bin': [Data] is a numerical, LOGICAL or CHAR array. Only the
binary contents of the array is considered, such that
e.g. empty arrays of different type reply the same hash.
'ascii': Same as 'bin', but only the 8-bit ASCII part of the 16-bit
Matlab CHARs is considered.
Hash: String or numeric vector.
EXAMPLES:
Default: MD5, hex:
DataHash([]) % 7de5637fd217d0e44e0082f4d79b3e73

SHA-1, Base64:
S.a = uint8([]);
S.b = {{1:10}, struct('q', uint64(415))};
DataHash(S, 'base64', 'SHA-1') % ZMe4eUAp0G9TDrvSW0/Qc0gQ9/A

Comparison with standard hash programs using ASCII strings:
DataHash('abc', 'SHA-256', 'ascii')

Michael Kleder's "Compute Hash" works similar, but does not accept structs, cells or files:
http://www.mathworks.com/matlabcentral/fileexchange/8944

"GetMD5" is 2 to 100 times faster, but it replies only MD5 hashes and a C-compiler is required:
http://www.mathworks.com/matlabcentral/fileexchange/25921

Tested: Matlab 7.7, 7.8, 7.13, 8.6, 9.1, 9.5, Win7&10/64, Java: 1.3, 1.6, 1.7

Bugreports and enhancement requests are welcome. Feel free to ask me about a version for Matlab 6.5.

PS. MD5 and SHA1 hash values are "broken": You can construct a data set, which has a specific hash. But to check the integrity of files or to to recognize a set of variables, both methods are reliable.

Cite As

Jan (2019). DataHash (https://www.mathworks.com/matlabcentral/fileexchange/31272-datahash), MATLAB Central File Exchange. Retrieved .

Comments and Ratings (69)

Jan

@mschiffn: Thanks for this useful idea. Equivalent to the problem with ISSTRING, the command ISENUM() was introduced in R2015a is not backward compatible. In addition it is not clear, if the element 'on' of the enumeration {'on', 'off'} should get the same hash as 'on' of the enumeration {'on', 'in', 'an'}. Therefore CELLSTR might not be the wanted behavior for all users.

mschiffn

Excellent work and very useful, Jan!

I noticed, however, that the function does not parse enumerations correctly.
For example:

classdef status
enumeration
on, off
end
end
result_on = DataHash(status.on);
result_off = DataHash(status.off);

Here, result_on and result_off are identical.

As a fix, I added the lines

elseif isenum(Data)
Engine = CoreHash(cellstr(Data), Engine);

to the function CoreHash.

mschiffn

Jan

Thanks, zmi zmi. This bug is fixed now.

zmi zmi

good job.

Minor note: hasGetMD5 is not defined in uTest_DataHash if GetMD5 does not exist.

i.e. code shoud be changed to

if ~isempty(getmd5_M) && ~isempty(getmd5_X)
Str = fileread(getmd5_M);
hasGetMD5 = any(strfind(Str, 'Author: Jan Simon'));
else
hasGetMD5 = false;
end

Jan

@Ondrei Tichacek: It depends on the needs of the user, if two anonymous functions with different workspaces should get different or equal hash value. If you need equal hashes, either modify the subfunction ConvertFuncHandle(), or provide a cleaned input manually:
H = DataHashAnonFcn(f)
fClean = functions(f);
fClean.workspace = []; % Or: fClean = rmfield(fClean, 'workspace');
H = DataHash(fClean);
end

Or:
function FuncKey = ConvertFuncHandle(FuncH)
FuncKey = functions(FuncH);
if strcmpi(FuncKey.type, 'anonymous')
FuncKey.workspace = [];
end
... etc

If the field ".file" should be considered for anonymous functions is questionable also. I assume, that hashes for (anonymous) function handles are hard to define in a way, that satisfies different use-cases.

Great contribution. The anonymous functions don't always behave as one would expect (I know they are extremely difficult to implement). Consider the following example

>> A = struct('a', 1, 'b', 0); f = @(x) A.a * x; DataHash(f)

ans =

'b0b356350494d93b62eea35415a96633'

>> A = struct('a', 1, 'b', 1); f = @(x) A.a * x; DataHash(f)

ans =

'fd662728458e7b5895a69cf47ea17a14'

It happens because the whole structure A is in the anonymous function workspace and therefore gets hashed with rest of the information. I wonder if there is a clever way how to (recursively) filter out unneeded fields of the structure.

It happens because the whole structure A is in the anonymous function workspace and therefore gets hashed with rest of the information. I wonder if there is a clever way how to (recursively) filter out unneeded fields of the structure.

Keith Ma

Thanks for this contribution - cryptographic hashes should definitely be a core MATLAB feature!

Jan

@Mohammad Asadi: The hashing functions SHA-x and MDx are not unique, if the bit size of the original data exceeds the one of the hash. Therefore the hashing cannot be reverted in general by design. It is possible to revert the hashing for small inputs, but because this can be used to crack password, I will not support this in any way.

You have done great job. Just a question please, How about the reverse process in getting back data?

ruilin guo

Thank you very much!
From china!

Jan

Ian Harrison suggested isstring() to check for strings. Unfortunately this function did not exist in older versions of Matlab, such that using it would break the backward compatibility. Therefore DataHash contains a dedicated function "myIsString()", which calls isstring(), when it exists, and replies FALSE for all Matlab versions before 2016b.

Jan

@Ledoux Laboratory: DataHash is case-sensitive in the fields of the input Opt. You have to define the field 'Input' with an uppercase I to define the mode:
fileMethod = struct('Input','file'); % not 'input'
arrayMethod= struct('Input','array');
Now both comparisons are true.
I take this as an enhancement request and the version published today is not case-sensitive in the Opt struct anymore. In addition the String class is working also now.

Hello:

There appears to be an issue with the 'file' input mode-- files with the exact same contents appear to give different hashes. Consider the following code:

%% create some random data
randomness = char( 255*rand(1,1e4))
F1 = 'file1.txt'; F2 = 'file2.txt'
% write this data to two files
fid=fopen(F1,'wt'); fwrite(fid, randomness); fclose(fid);
fid=fopen(F2,'wt'); fwrite(fid, randomness); fclose(fid);
fileMethod = struct('input','file');
arrayMethod= struct('input','array');
% compare hashes using 'array' input mode-- this returns true as expected, since file contents are same
isequal( DataHash( fileread(F1),arrayMethod), DataHash( fileread(F2), arrayMethod)) % returns true, since file contents are same
% compare hashes using 'file' input mode -- this returns false, but I'm not sure why
isequal( DataHash(F1, fileMethod), DataHash(F2, fileMethod))

Based on the function documentation, the 'file' mode only considers the contents of the file, not the metadata (filename, modification date, etc.), correct? Or am I misinterpreting how the 'file' mode is meant to be used?

Thanks

thank you!

Will Roe

Very useful!

Upvote for Ian Harrison's post below. I ran into that issue when I switched from 2016a to 2018a.

Will Roe

Justin

Firstly, I love this function.

Secondly, I've added the following to the multi if in the function Engine = CoreHash(Data, Engine)

elseif isstring(Data)
Engine = CoreHash(char(Data), Engine); %if string type cast to char and rerun

I've put it betwee the elseif isnumeric(Data) and the elseif islogical(Data)

The reason was that I have a struct and it has strings that are empty in it, this was being caught by the final else and was being caught by the try catch as a BadDataType.

This has fixed my issue, just thought I'd let you know :D

Thank you Jan SImon

Kai Zhang

Chris

Thank you very much.

Keith Ma

Thanks for this submission.

It just works, and the author is also quick and competent in user support. Thanks!

Jan

@Christian: Thanks you to finding this bug. It is fixed now.

Christian

Hey this is a great function! I've run into a problem though. I found if the file is empty (size of 0 bytes), it throws an error:

Error using java.security.MessageDigest$Delegate/update
A Non-scalar value was passed for a scalar argument

Error in DataHash (line 231)
Engine.update(Data);

Perhaps it should return whatever the hash is for a empty stream or an empty string?

Awesome function! It would be great if other encodings (such as base32 and base36) were supported as well. I implemented base32 as an extension of yours base64 (though not 100% sure if correctly) but have no clue how to get base36.

Jan

@Uilke: Wow, this is a bug in the documentation. 240f... is the expected output. The results have been changed in the version submitted on 30-Mar-2015, but I forgot to update the help text accordingly.

Tried datahash([]) on Win7 64-bits with ML 7.13 (R2011b) and 8.6 (R2015b) and with both get 240f5f01f052bd89f38da2165dcf25c7 instead of the in the help mentioned 7de5637fd217d0e44e0082f4d79b3e73
Any ideas why?

Matt Raum

@Matt Raum: I found a workaround using the undocumented serialization function getByteStreamFromArray

DataHash(getByteStreamFromArray(MyClassObj))

gets me what I need.

Matt Raum

Jan -- Great work on this, very handy! I'd like to be able to use this to hash an instance of a custom class but I'm getting the error below:

Warning: Type of variable not considered: MyClass
> In DataHash>CoreHash_ at 336
In DataHash at 253

Can you help?

Matt Raum

Jan

@Ivan Cordon: The effect is explained in NOTES in the help section and was discussed at 26 May 2012:
To get the same result as in the online generator, only the binary contents of the data must be considered, not the Matlab class, and only the 8-bit ASCII part of the 16-bit Matlab CHAR:
Opt.Method='SHA-256'; Opt.Input='bin';
DataHash(uint8('ivan'), Opt)
Then you get cd0b9452... also.

Or a little bit shorter with the newest version:
Opt.Method='SHA-256'; Opt.Input='ascii';
DataHash('ivan', Opt)

Ivan Cordon

Dear Jan,

The program works for me but I seem to be getting a wrong result. I inputting the Opt struct with these parameters:

Method: 'SHA-256'

and calling the function as follows:

Hash = DataHash('ivan', Opt)

the result i am getting is:

53607f405e61c5b4bd7796d5ef1ce7cc123559094123c314726c68aac1e21dab

whereas checking on the different online generators (http://www.xorbin.com/tools/sha256-hash-calculator) i get this:

cd0b9452fc376fc4c35a60087b366f70d883fc901524daf1f122fbd319384f6a

Do you know why I am getting this difference?

Thanks!!

Jan

@Haitam: Do you want to obtain the clear text from the hash value? This is possible only if the input has not more bits than the hash value. For longer messages the result is not unique. And even for short messages the hashing algorithms are not designed for a reverse analysis. It is possible e.g. by a brute force attack, but this problem is not covered by my submission.

Haitham

Dear simon , the code is working great , i would like to know how i can recompute the hash value?
The data hash generated 558f68181d2b0c9d57d41ce7aa36b71d9
i would like to get the original value before hashing it.
thanks a lot...

Jan

@Haitham: Look at the examples for a valid method to call this function. The error message tells you that you did not provide input arguments, or too many of them. Check the command or contact me under the address given in the help section (THISYEAR is "2010", sorry, a typo).

Haitham

Dear jan
thanks for the file , but i am getting error while running the code it self.

Error using DataHash>Error_L (line 466)
*** DataHash: 1 or 2 inputs required.

Error in DataHash (line 177)
Error_L('BadNInput', '1 or 2 inputs required.');

Great submission, Jan. Consider putting the code on GitHub so that users can more easily comment and contribute.

Jan

Unfortunately there are some collisions (same hash for different data). E.g.: The scalar 0 and the empty double array [1 x 1 x 0] get the same hash. I'm going to publish a new version, which fixes these bugs, but produce other hashs in consequence. In addition sparse arrays are considered.

The submission CalcMD5 is updated also, such that (nested) cells and structs are considered also. Because the overhead of calling Java is omitted, this is dramatically faster than DataHash, I get a speedup factor > 100 for nested structs.

Well written, good running time

Giorgio

Well written!

Clemens

Nath

function handle refering to struct containing the function will create infinite loop. Is there any workaround ?

Exemple:

d= dynamicprops();
addprop(d,'f');
d.f= @(varargin) struct2cell(d);
DataHash(d.f) % infinite loop

Arindam Bose

Hi, Do anyone have any idea, how to decrypt an MD5 code?

Igor

David

Hi, I'm just wondering if this function works for p-files? Line 144-147 tells the function to stop if the file is not an m-file. When I comment these lines out, it seems to provide the correct hash for the p-file.

Jan

Thanks, Jan, a good idea. I will include this in the next version.

Great function! Just as a little improvement, I added support for Java and MATLAB objects by calling their hashCode function. Just insert

elseif (isobject(Data) || isjava(Data))....
&& ismethod(Data, 'hashCode')
Engine = CoreHash(Data.hashCode, Engine);

into the CoreHash() and CoreHash_() functions (in the main if branches).

Regards, Jan

Jan

The examples use 8-bit CHAR strings, while Matlab uses 16 bits per CHAR. In addition DataHash includes information about the type and dimension of the input, as described in the help text. Because e.g. Mido asked for a version, which considers the actual data only last September, I've included this feature last year. Note the conversion to 8-bits:
Opt.Input = 'bin'; Opt.Method = 'MD5';
DataHash(uint8('The quick brown fox jumps over the lazy dog.'), Opt)
% >> e4d909c290d0fb1ca068ffaddf22cbd0
as expected.
But DataHash fails for the empty string and binary input! Thanks for finding this bug, it will be fixed soon.

It would be nice to be able to check the returned values against some public results. For example Wikipedia mentions MD5("") = d41d8cd98f00b204e9800998ecf8427e and MD5("The quick brown fox jumps over the lazy dog.")
= e4d909c290d0fb1ca068ffaddf22cbd0, and I don't know how to get the same from DataHash to check it.

Jan

@Oyvind: I have posted an MD5-calculator also as MEX-function. But it does not allow the direct accumulation of the hash for nested cells or structs. But you could do a XORing of the partial hashes. But this will not be trivial.

Oyvind

Hi,
Is it possible to run Datahash without Java (I am stuck with the limitation of excel 2003 using a compiled version of datahash tha requires JVM).
Oyvind

Jan

@Mido: Now you can use the Opt.Input='bin' method to create the hash for the raw data.

Jan

@Mido: DataHash considers the class and dimensions of the inputs, otherwise UINT8([0,0]) and UINT16(0) would have the same hash.
I'm going to add the option, that only the contents of the input is considered. But even then the MD5 of 'I am not happy' will not be '59b4...', because Matlab uses 16 bits for a CHAR - you look for UINT8('I am not happy').
You can find a fast Md5 tool here: http://www.mathworks.com/matlabcentral/fileexchange/25921-calcmd5

Mido Mido

It does not work with me!!
When i tried:

Opt.Format = 'HEX';
Opt.Method = 'MD5';
DataHash('I am not happy', Opt)

it gives me : "7C23124A8F69D72A65C0E86A4B9075CF"
although the correct is:
"59b469ea3ffbe72cf4983facf13cbe1f"

The same also for SHA-1!!

Could any one help?

Martin

Aaron

Extremely well made function.
Very easy to use.
Works correctly.

Very useful for my simulations parameters sets !

Thank you Jan,

Regards,

François.

Updates

1.7.1

Bugfix in unit test function: GetMD5 caught correctly now.

1.7

Improved handling of strings. Accept inputs as list of char vectors.

1.6

Opt struct not case-senisitve. String class accepted as data.

1.5.0.0

Only the description has been changed. The code was not touched.

1.5.0.0

Bugfix: Failed for empty files.

1.5.0.0

No need to use TYPECASTX in modern Matlab versions. Hashes have changed for 'Array' mode to reply the same output as GetMD5. Speed improvements for STRUCT arrays.

1.4.0.0

In the version 30-Mar-2015 the results have been changed, but the help section has not been adjusted. This is fixed now.
Structs arrays are processed much faster now and the checksum differs from earlier versions.

1.4.0.0

Fixed bugs: Strings and empty array for "binary" mode. For "array" mode [] and zeros(1,1,0) had the same hash before.

1.3.0.0

Accept empty input for binary mode now.

1.2.0.0

Binary mode add to consider only the contents of the data.

MATLAB Release Compatibility
Created with R2016b
Compatible with any release
Platform Compatibility
Windows macOS Linux