Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

NetCDF or HDF5 or XYZ to provide time series data at the fingertips of the user

Asked by per isakson on 5 May 2012

.

Question: Have I done my homework well enough to choose HDF5 and stop thinking about alternatives?

One more question: Which are the problems with HDF5 that I have overlooked? Will I face unpleasant surprises?

Currently, I store time series data from building automation systems, BAS, in large structures, often named X, in mat-files. Each time series is stored in one field. I use the denomination, Qty, for these timeseries. A typical X has 1000 fields and is 100MB and larger. I have used that "format" for more than ten years. However, I search for something better.

Goals: The user of a visualization tool shall have a huge amount of time series data at the finger tips. Read and write data-files with non-Matlab applications

What I done sofar:

  1. Experimented and used a system based on 128KB memmapfiles. Each time series is stored in a series of memmapfiles. Some metadata is embedded in the filename. It required too much coding and I failed to make it fast enough. Skipped!
  2. Studied some FEX-contributions; Waterloo File and Matrix Utilities; HDS-Toolbox (RNEL-DB); and ... . I share their description of the problem and the goal, but ... and a bit too smart to my capacity.
  3. Googled for NetCDF and HDF; decided to try NetCDF; an experiment with Matlabs high level API (ncwrite, ncread, ...); experienced very poor performance or worse.
  4. Searched in FEX for NetCDF and HDF5. There are 21 and 13 hits, respectively.
  5. A performance test. I used a *structure, X, with 1346 fields each holding a .<66528x1 double> time series*. The total size of X is 0.7GB. R2012a, Windows7, 64bit. The test included writing the data of the X-structure to the file in question (with X2hdf) and reading the data back to a -structure (with hdf2X). Corresponding functions for NetCDF are nearly identical with "h5" replaced by "nc". With NetCDF, I used the format, netcdf4_classic, and "'Dimensions', { 'qty', len_time }", i.e. a fixed and limited length.
    Execution time in seconds
    --------------------------------------
    Method              write       read       
    HDF5                32.6        2.8
    NetCDF(1)           inf         inf
    save,load(2)        24.4        7.3
    fwrite,fread(3)     3.8         1.3
    read_hdf (FEX)                  3.3
    read_netcdf (FEX)               8.1
    matfile(4)          74          196
    --------------------------------------
  1. the result with NetCFD is strange. "inf" stands for two order of magnitude longer than the corresponding values for HDF5. NetCDF uses ...\netcdflib.mexw64 and read_netcdf uses unidata.ucar.edu/pub/netcdf-java/v4.1/netcdf-4.1.jar. The only problem with NetCDF is the performance. There were no other indications like warnings or anything. The profiler says that the time is spent in the mex-file. It behaves somewhat like a loop with a growing variable.
  2. "tic, save, toc" just to get something to compare to
  3. writes and reads 1345 simple files to a catalog - provides a lower limit to the execution time
  4. matfile stands for the matlab.io.MatFile class. The results are unbelievable! How come it's so slow? The resulting file is not compressed. The profiler says 99.7% of the time is spent in built-in, overhead, etc. when reading the mat-file.

The speed of read_netcdf (java) is remarkable in comparison with NetCDF (ncread).

Next, I made another test with NetCFD. Instead of "nccreate & ncwrite" I recovered the schema from the nc-data-file and did "ncinfo, ncwriteschema & ncwrite". That didn't improve the speed. I might do something terribly wrong, but the problem is that Matlab doesn't provide the slightest hint to help me out.

2012-05-11 I have added the result of another HDF5-test with variants to the Question, Matlab slows down when writing to large file

--- Why not fwrite and fread ---

After I dropped my memmap-experiment I did a similar experiment with fwrite and fread. Initially I was rather enthusiastic.

  1. One time series is stored in a few files. Maximal length three month worth of data.
  2. The files is written and read as single. The first 512B is metadata cast as single. The rest is divided between timestamp and time series data (also single). Timestamp is seconds in whole numbers.
  3. Some metadata is embedded in the file name.

Lesson learned:

  1. It is doable and performance is good.
  2. More code than I anticipated is needed to make it work. And that code will need to be maintained.
  3. Scalability might be a problem, since a lot of files can choke Windows.

@Oleg (2012-05-09), 1000 series per building, 3 years of data, 24 buildings. Each file holds 3 months (and less) of data for one series. This way and others are how I have choked Windows in the past. With HDF5 I hoped for one file per building.

--- Why not MySQL ---

I participated in the development of a data base system for trenddata from new office buildings. The system is in operation since a few years. The development was done by a data base pro.

The performance of that system is far from good enough to feed data to a visualization tool an give the user the feeling of the "data at the fingertips".

@Oleg (2012-05-07), we used a star schema with observations in the fact table and a few dimensions - nothing fancy. Thus, one scalar value of a time series per row in the fact table. The index was larger than the "data tables" and grow faster with size than the data. During development and testing we run the data base on a low-end Dell server. The reading performance decreased with increasing size of the data base. We planned to use the feature, partitioning, which was new in MySQL at that time (ver5.1?). We intended to partition with regard to the time dimension. The size of test data base was at most a few GB. The reading performance was not anywhere near 20GB per minute. However, performance obviously depends on the schema, index, and not least hardware. (The queries were straight forward.)

One could store chunks of time series in blobs in the data base. That would change performance dramatically

--- Bottom line ---

HDF5 provides a lot of valuable features for free. The performance seems to be good enough. Furthermore, it seems easy enough to use. What are the problem?

Dimitris’s Blog say that one of the major problems is that HDF5 is open and self describing. Everybody may look inside.

--- To be continued ---

@Oleg (2012-05-16),

---------------

    function    timing = X2hdf( X )
        FileSpec    = 'c:\MyData\ES\Nereus\mat\hdf5.hdf';
        len_time    = length( X.sdn );
        timing  = nan( 2, ceil( n_qty / 10 ) );
        jj      = 0;
        ticID   = tic;
        h5create(   FileSpec                ...
                ,   '/sdn'                  ...
                ,   [ len_time, 1 ]         ...
                ,   'Datatype', 'double'    ...  
                )
        h5write( FileSpec, '/sdn', X.sdn )       
        for ii = 1 : n_qty
            if rem( ii, 10 ) == 0,   
                jj = jj+1; 
                timing(:,jj) = [ ii; toc( ticID ) ];
            end
            qty = char( X.QtyList( 1, ii ) );
            h5create(   FileSpec                ...
                    ,   ['/',qty]               ...
                    ,   [ len_time, 1 ]         ...
                    ,   'Datatype', 'double'    ...  
                    )
            h5write(    FileSpec,  ['/',qty], X.(qty) )
        end
    end
    function    [ timing, X ] = hdf2X()       
    %
        FileSpec = 'c:\MyData\ES\Nereus\mat\hdf5.hdf';
        info = h5info( FileSpec );
        n_qty = numel( info.Datasets );
        timing  = nan( 2, ceil( n_qty / 10 ) );
        jj      = 0;
        ticID   = tic;
        for ii = 1 : n_qty
            if rem( ii, 10 ) == 0,   
                jj=jj+1; timing(:,jj) = [ ii; toc( ticID ) ];             
            end
            qty = info.Datasets(ii).Name;
            X.(qty) =  h5read( FileSpec, ['/',qty] );
        end
    end

--- Added 2012-05-09 ---

The m-files below test the matlab.io.MatFile class.

  1. X is a structure with 1346 fields each holding a .<66528x1 double> time series.
  2. Nereus_matfile.h5 is a mat-file containing 1346 .<66528x1 double> vectors

I failed to read Nereus_matfile.h5 with read_hdf.

Cell contents reference from a non-cell array object.
Error in read_hdf5 (line 35) 

.

    function    timing = X2matfile( X )       
        FileSpec = 'c:\MyData\ES\Nereus\mat\Nereus_matfile.h5';
        % length of list of names of time series 
        n_qty = length( X.QtyList );       
        timing  = nan( 2, ceil( n_qty / 10 ) );
        jj      = 0;
        ticID   = tic;
        moh = matfile( FileSpec );
        moh.sdn = X.sdn;
        for ii = 1 : n_qty
            if rem( ii, 10 ) == 0,  
                jj=jj+1; timing(:,jj) = [ ii; toc( ticID ) ];             
            end
            qty = char( X.QtyList( 1, ii ) );
            moh.(qty)   = X.(qty);
        end
    end
    function    [ X, timing ] = matfile2X() 
        FileSpec = 'c:\MyData\ES\Nereus\mat\Nereus_matfile.h5';
        moh = matfile( FileSpec );
        mow = whos( moh );
        n_qty = size( mow, 1 );
        timing  = nan( 2, ceil( n_qty / 10 ) );
        jj      = 0;
        ticID   = tic;
        for ii = 1 : n_qty
            if rem( ii, 10 ) == 0,   
                jj=jj+1; timing(:,jj) = [ ii; toc( ticID ) ];             
            end
            X.(mow(ii).name) =  moh.(mow(ii).name);
        end
    end

6 Comments

Oleg Komarov on 6 May 2012

Actually, if you like we could work together on fread/fwrite solution.

Sean de Wolski on 9 May 2012

That's why it is slow. Using a structure, the _entire_ structure has to be read into memory.

per isakson on 9 May 2012

Loading the structure to memory takes 7.3 seconds. However, that is not included in the test of matfile. The structure is loaded beforehand and passed to the function X2matfile.

per isakson

Products

3 Answers

Answer by Sean de Wolski on 7 May 2012

Have you looked at the MATFILE class in newer ML releases? It allows you the ability to access variables and pieces of variables of a mat-file (hdf5).

This would require creating many variables to be efficient, i.e: each time series would be its own variable, you could store the metadata in the variable name as you described above. I know this is typically frowned upon (a1,a2,...an) but it would give you quick and easy access to what you need.

Just a thought, I may be completely off base and I apolgize if I am.

6 Comments

Sean de Wolski on 8 May 2012

@Per, what tests are you putting matfile through? It will be slow if you are indexing into cells or structs since the entire file must be pulled into memory. Make sure you try it with individual variables to be indexed.

Oleg Komarov on 8 May 2012

Or you can create a m by 2 matrix where you concatenate vertically several time series. Then store a master file which retains start and end of each time series. This is basically the approach I would also use with fread/fwrite.

Sean de Wolski on 8 May 2012

Yes, Oleg's approach would work well, pad with nans for values you don't have. Store the metadata in a separate matfile or cell array.

Sean de Wolski
Answer by T. on 16 Jan 2013
Edited by T. on 16 Jan 2013

I have also done a of experiments with the performance of netCDF within matlab. Some findings:

  • The matlab high level functions ncread and ncwrite have some performance issues by design: every command requires matlab to read the header of the netCDF file in order to determine the command to pass to the low level functions netcdf.getVar, netcdf.putVar etc.
  • The time it takes to read the header of a netCDF file is much greater for netCDF4 (which is HDF5) than for netCDF3, as netCDF3 is much simpler. Also, the complexity of the header increases with the number of variables in a file,; tens is usually workable, hundreds gives a very poor performance.

So to improve netCDF performance, try using version 3 if you can. Otherwise, try calling the low level functions netcdf.xxx instead of the high level functions.

What matlab would need (IMHO) is a high level, built in, object oriented, function to deal with netCDF files. In that function the netCDF file stays open, and the header is cached.

Here is some example code to illustrate the problem

for format = {'classic','netcdf4'}
    fprintf(1,'\nFormat = = %s\n',format{:});
    if exist('test.nc','file')
        delete('test.nc')
    end
    nVars = 100;
    for jj = 0:4
        fprintf(1,'\nvariables = %d\n',nVars * (jj+1));
        for ii = (1:nVars)+nVars * jj
            nccreate('test.nc',sprintf('var%03.0f',ii),...
                'Dimensions',{'r' 400 'c' 1},...
                'Format',format{:});
        end
          for ii = (1:50:nVars)+nVars * jj
              ncwrite('test.nc',sprintf('var%03.0f',ii),reshape(peaks(20),[],1));
          end
          for ii = (1:50:nVars)+nVars * jj
              tic
              ncread('test.nc',sprintf('var%03.0f',ii));
              toc
          end
      end
  end

0 Comments

T.
Answer by Malcolm Lidierth on 3 Mar 2013

@Per

I suspect some of the problems with memmapfile might be related to using multiple 128KB memmapfiles. Each requires system resources. The Waterloo File Utilities grew out of the sigTOOL project where I had a similar issue. In that case, each channel was represented by a memmmapfile object, but there might be many hundreds of channels. The "trick" I used was to was to dynamically instantiate the memmapfile instances only on demand (not when the file was first accessed) and to destroy them when not needed. That has allowed sigTOOL uses to work with files of many Gb.

With an HDF5 file, you can still use memory mapping by retrieving the byte offset to your data if:

  1. The data are not chunked
  2. The data are not compressed

This is a limitation of the API rather than the file format I believe, and you could use external mechanisms to break up large data files into separate components leaving HDF5 not knowing about the "chunking" internally and use external compression before writing the data.

My solution in the dev version of sigTOOL is to use a folder, not a file for the data. Each folder, has a few cross-referenced files allowing me to mix *.mat, *.bin, *.hdf5, *.xml etc. It's ugly perhaps, and raises synch issues, but it allows me to take advantage of the best format for different data sets without tying me to their limitations.

Regards ML

0 Comments

Malcolm Lidierth

Contact us