Why does Matlab transpose hdf5 data?

19 views (last 30 days)
There is an apparent bug in Matlab HDF5 read/write utility that breaks interoperability with other code. Simple array datasets are read/written as the transpose of their actual shape. I imagine this is because Matlab uses column-major (Fortran-style) order, whereas the HDF5 standard uses row-major (C-style) order.
Minimal example that illustrates the problem:
h5create('test.h5', '/dataset', [2,3]);
h5write('test.h5', '/dataset', reshape(1:6,[2,3]))
Running the HDF5 utility h5ls on the output reveals the problem:
$ h5ls test.h5
dataset Dataset {3, 2}
This is not evident if only using the HDF5 tools from within Matlab, since reading the dataset in also transposes it back.
>> h5read('test.h5', '/dataset')
ans =
1 3 5
2 4 6
Matlab should either fix this in future versions or mention the convention in the documentation, since people mostly choose HDF5 for interoperability with other systems, and this can be a tricky bug to find.
In versions:
  • h5ls: Version 1.8.14
  • Matlab 8.6.0.267246 (R2015b) GLNXA64
  1 Comment
Daniel Döhring
Daniel Döhring on 24 May 2019
Edited: Daniel Döhring on 24 May 2019
Actually this bug seems to be still around. In my case, a (pseudo) multiarray of dimensions is in Matlab internally permuted to . As a consequence, it is impossible to write back a multiarray in dimensions , since Matlab does not represent matrices in manner.

Sign in to comment.

Accepted Answer

James Tursa
James Tursa on 20 Oct 2016
Edited: Walter Roberson on 21 Oct 2016
In the following link:
I read the following under Data Layout:
"Contiguous: The array is stored in one contiguous area of the file. This layout requires that the size of the array be constant"
"The offset of an element from the beginning of the storage area is computed as in a C array."
"The first dimension stored in the list of dimensions is the slowest changing dimension and the last dimension stored is the fastest changing dimension."
So, yes this appears to be clear that the data storage order in the file is "C" array convention, and I can find no options that allow a "Fortran" array convention.
That being said, the dimensions that apparently got stored in the file appear to be correct. I.e., the slowest changing dimension (3) did in fact get stored in the file first, followed by the fastest changing dimension (2). This assumes of course that the data was written into the file in the order 1, 2, 3, 4, 5, 6. So the data appears to be written to the file correctly as far as that goes (i.e., the dimensions stored in the file match the data order in the file). It just didn't get written out in the order you expected. So looks like you would need to manually transpose for 2D (or permute for nD) on the MATLAB side as you suggested if you want the data in the file to look like the "same" dimensions as the MATLAB variable.
Maybe submit a bug report and see what TMW has to say about all this. I don't know if I would classify this as a "bug" per-se since the dimensions and data storage in the file appear to match each other. What I might expect is that MATLAB would match whatever the official Fortran HDF5 interface subroutines do. If the official Fortran API routines do the same thing as MATLAB then I would say MATLAB did it correctly (but should document this behavior). But if the official Fortran API routines permute the data into "C" array storage order, then MATLAB is out of bed with this and I might call it a bug even though the file is written correctly (just didn't match the apparent expectation of the HDF Group). (Maybe contact the HDF Group and ask them that question).

More Answers (3)

Kameron Harris
Kameron Harris on 20 Oct 2016
Edited: Kameron Harris on 20 Oct 2016
Starting in v1.5, HDF5 allows the dimension permutation (C- or FORTRAN-order) to be specified in the file, presumably for interoperability: https://support.hdfgroup.org/HDF5/doc1.8/H5.intro.html

Kameron Harris
Kameron Harris on 20 Oct 2016
Edited: Kameron Harris on 20 Oct 2016
Effectively, it is a bug for me. If I read my matrix into python (using h5py library) or C++, etc., then the matrices are returned to me as represented by h5ls.
I was able to figure it out and transpose when reading/writing HDF5 to interface with my other programs.
It's weird that HDF5 references this in terms of "fastest changing dimension", which is not consistent across programming languages. When people want to store/access their data, they want it to come in a single format that doesn't depend on implementation.
Thanks for your response.
  1 Comment
James Tursa
James Tursa on 20 Oct 2016
The HDF Group intent seems to be that applications should be able to write to the file in a native storage order. This seems reasonable to me, especially from a speed standpoint. Why cripple column-ordered languages (Fortran, MATLAB) with a hard requirement to permute the data each time you read/write?

Sign in to comment.


Kameron Harris
Kameron Harris on 20 Oct 2016
Edited: Kameron Harris on 20 Oct 2016
Interesting, from the FORTRAN HDF5 library, https://support.hdfgroup.org/HDF5/doc1.8/fortran/index.html
"When a C application reads data stored from a Fortran program, the data will appear to be transposed due to the difference in the C and Fortran storage orders. For example, if Fortran writes a 4x6 two-dimensional dataset to the file, a C program will read it as a 6x4 two-dimensional dataset into memory. The HDF5 C utilities h5dump and h5ls will also display transposed data, if data is written from a Fortran program."
  2 Comments
James Tursa
James Tursa on 20 Oct 2016
Well, so this pretty much answers the question. The HDF Group intended the various applications (Fortran, MATLAB, C, C++, Python, etc) to be able to write to the file in a native storage order and simply list the dimensions of the data in the file in a specified order (slowest changing first ... fastest changing last). It is then incumbent on the user to know what storage order his/her applications use if they are to share data through this file format ... and permute the data accordingly if necessary.
So given this language in the HDF doc, I would say MATLAB is doing everything correctly (but maybe could help the user out with some documentation about interoperability with other languages/applications).

Sign in to comment.

Tags

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!