Why does Matlab transpose hdf5 data?
19 views (last 30 days)
Show older comments
Kameron Harris
on 20 Oct 2016
Edited: Daniel Döhring
on 24 May 2019
There is an apparent bug in Matlab HDF5 read/write utility that breaks interoperability with other code. Simple array datasets are read/written as the transpose of their actual shape. I imagine this is because Matlab uses column-major (Fortran-style) order, whereas the HDF5 standard uses row-major (C-style) order.
Minimal example that illustrates the problem:
h5create('test.h5', '/dataset', [2,3]);
h5write('test.h5', '/dataset', reshape(1:6,[2,3]))
Running the HDF5 utility h5ls on the output reveals the problem:
$ h5ls test.h5
dataset Dataset {3, 2}
This is not evident if only using the HDF5 tools from within Matlab, since reading the dataset in also transposes it back.
>> h5read('test.h5', '/dataset')
ans =
1 3 5
2 4 6
Matlab should either fix this in future versions or mention the convention in the documentation, since people mostly choose HDF5 for interoperability with other systems, and this can be a tricky bug to find.
In versions:
- h5ls: Version 1.8.14
- Matlab 8.6.0.267246 (R2015b) GLNXA64
1 Comment
Daniel Döhring
on 24 May 2019
Edited: Daniel Döhring
on 24 May 2019
Actually this bug seems to be still around. In my case, a (pseudo) multiarray of dimensions
is in Matlab internally permuted to
. As a consequence, it is impossible to write back a multiarray in dimensions
, since Matlab does not represent matrices in
manner.
Accepted Answer
James Tursa
on 20 Oct 2016
Edited: Walter Roberson
on 21 Oct 2016
In the following link:
I read the following under Data Layout:
"Contiguous: The array is stored in one contiguous area of the file. This layout requires that the size of the array be constant"
"The offset of an element from the beginning of the storage area is computed as in a C array."
"The first dimension stored in the list of dimensions is the slowest changing dimension and the last dimension stored is the fastest changing dimension."
So, yes this appears to be clear that the data storage order in the file is "C" array convention, and I can find no options that allow a "Fortran" array convention.
That being said, the dimensions that apparently got stored in the file appear to be correct. I.e., the slowest changing dimension (3) did in fact get stored in the file first, followed by the fastest changing dimension (2). This assumes of course that the data was written into the file in the order 1, 2, 3, 4, 5, 6. So the data appears to be written to the file correctly as far as that goes (i.e., the dimensions stored in the file match the data order in the file). It just didn't get written out in the order you expected. So looks like you would need to manually transpose for 2D (or permute for nD) on the MATLAB side as you suggested if you want the data in the file to look like the "same" dimensions as the MATLAB variable.
Maybe submit a bug report and see what TMW has to say about all this. I don't know if I would classify this as a "bug" per-se since the dimensions and data storage in the file appear to match each other. What I might expect is that MATLAB would match whatever the official Fortran HDF5 interface subroutines do. If the official Fortran API routines do the same thing as MATLAB then I would say MATLAB did it correctly (but should document this behavior). But if the official Fortran API routines permute the data into "C" array storage order, then MATLAB is out of bed with this and I might call it a bug even though the file is written correctly (just didn't match the apparent expectation of the HDF Group). (Maybe contact the HDF Group and ask them that question).
0 Comments
More Answers (3)
Kameron Harris
on 20 Oct 2016
Edited: Kameron Harris
on 20 Oct 2016
1 Comment
James Tursa
on 20 Oct 2016
The HDF Group intent seems to be that applications should be able to write to the file in a native storage order. This seems reasonable to me, especially from a speed standpoint. Why cripple column-ordered languages (Fortran, MATLAB) with a hard requirement to permute the data each time you read/write?
Kameron Harris
on 20 Oct 2016
Edited: Kameron Harris
on 20 Oct 2016
2 Comments
James Tursa
on 20 Oct 2016
Well, so this pretty much answers the question. The HDF Group intended the various applications (Fortran, MATLAB, C, C++, Python, etc) to be able to write to the file in a native storage order and simply list the dimensions of the data in the file in a specified order (slowest changing first ... fastest changing last). It is then incumbent on the user to know what storage order his/her applications use if they are to share data through this file format ... and permute the data accordingly if necessary.
So given this language in the HDF doc, I would say MATLAB is doing everything correctly (but maybe could help the user out with some documentation about interoperability with other languages/applications).
See Also
Categories
Find more on HDF5 in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!