Matfile runs incredibly slowly on large files--what might be the problem?

Question

Michael on 5 Jul 2013

2
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/81232-matfile-runs-incredibly-slowly-on-large-files-what-might-be-the-problem

Edited: per isakson on 14 Sep 2021

I have a matlab file that contains one variable, a 64000x31250 array of singles. I use matfile to pull single columns out of that array. I've done similar operations on smaller (say 7000x31250) arrays and had it work fine. However, with this matrix, each column read takes 20!!!! seconds. In the profiler, essentially all of the time is taken on matfile.m's line 460:

[varargout{1:nargout}] = internal.matlab.language.partialLoad(obj.Properties.Source, varSubset, '-mat');

all this work (saving, matfile'ing, etc.) is done in 2012B and in 7.3 file format.

To set the performance scale, reading in the entire variable with a load command takes 127 seconds (ie less than the time matfile is taking to read 7 of the 31250 columns).

edit: a few details I should have included: 24 gigs ram, windows 7 x64, CPU is i7-950 (4 cores, 8 with hyperthreading), disk activity is very, very low during this process, but a single core is running at max speed (ie, one matlab process is using 13% CPU on the "8 core" CPU throughout.

Any ideas why matfile is choking so badly?

4 Comments
Show 2 older commentsHide 2 older comments

per isakson on 25 Jul 2013

Edited: per isakson on 25 Jul 2013

"column-major" does not apply to the matlab.io.MatFile class when it comes to reading speed. See the answer. Reading rows is MUCH faster.

Isaac Asimov on 25 Jan 2018

Open in MATLAB Online

I have also encountered this problem in R2017a.

Reading the data by row or by column does not affect the performance remarkably.

I locate the lines which take most time:

(Line 459 ~ 463 in file: `InstallPath\MATLAB\toolbox\matlab\iofun\+matlab\+io\MatFile.m`)

 if obj.Properties.SupportsPartialAccess
     [varargout{1:nargout}] = matlab.internal.language.partialLoad(obj.Properties.Source, varSubset, '-mat');
 else
     [varargout{1:nargout}] = inefficientPartialLoad(obj, indexingStruct, varName);
 end

MAT-files saved by '-v7.3' invoke the upper line and MAT-files saved by other versions ('-v7','-v6') invoke the lower line.

But I do not know why this line takes so much time.

If you comment the upper line `matlab.internal.language.partialLoad(...)`, and replace it with the lower line `inefficientPartialLoad(...)`, the performance does not change too much.

It seems that the upper function is not more efficient than the lower `inefficientPartialLoad`.

And I hope that this function could be improved by the development team of The MathWorks.

After all, it is unpractical to `load` large data to the workspace in order to speed up the program. (However, `load` MAT-files is also much slower than what we expect.)

Sign in to comment.

Sign in to answer this question.

Answer 1

per isakson on 6 Jul 2013

3
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/81232-matfile-runs-incredibly-slowly-on-large-files-what-might-be-the-problem#answer_91014

Edited: per isakson on 25 Jul 2013

Open in MATLAB Online

Summary: "column-major" does not apply to the matlab.io.MatFile class when it comes to reading speed.

---

Column-major or row-major?

Doc on hdf5read says:

    [...]HDF5 describes data set dimensions in row-major order; MATLAB stores 
    data in column-major order.  However, permuting these dimensions may not 
    correctly reflect the intent of the data and may invalidate metadata. When 
    BOOL is false (the default), the data dimensions correctly reflect the data 
    ordering as it is written in the file — each dimension in the output variable 
    matches the same dimension in the file.

Matlab uses column-major order and HDF5 uses row-major order. The MAT-file 7.3 file format "is" HDF5.

The following test ( R2012a 64bit, 8GB, Windows 7) shows that for a .<5000x5000 single>:

reading one column takes approximately half the time compared to reading the full matrix
reading one row is approx 20 times faster than reading one column.

In this case the matrix is so small that my 8GB should not be a bottleneck.

    N = 5e3;
    filespec = 'matfile_test.mat';
    mat = rand( N, 'single' );
    save( filespec, 'mat', '-v7.3' )
    obj = matfile( filespec );
    tic, mfm = obj.mat; toc
    tic, h5m = h5read( filespec, '/mat' ); toc
    dfm  = mfm-mat;
    d5m  = h5m-mat;
    max(abs(dfm(:)))
    max(abs(d5m(:)))
    tic, mfm = obj.mat( :, 1 ); toc
    tic, h5m = h5read( filespec, '/mat', [1,1], [N,1] ); toc
    dfm  = mfm-mat( :, 1 );
    d5m  = h5m-mat( :, 1 );
    max(abs(dfm(:)))
    max(abs(d5m(:)))
    tic, mfm = obj.mat( 1, : ); toc
    tic, h5m = h5read( filespec, '/mat', [1,1], [1,N] ); toc
    dfm  = mfm-mat( 1, : );
    d5m  = h5m-mat( 1, : );
    max(abs(dfm(:)))
    max(abs(d5m(:)))

returns

    Elapsed time is 1.955082 seconds.
    Elapsed time is 1.674106 seconds.
    ans =
         0
    ans =
         0
    Elapsed time is 0.984833 seconds.
    Elapsed time is 0.822843 seconds.
    ans =
         0
    ans =
         0
    Elapsed time is 0.056097 seconds.
    Elapsed time is 0.029657 seconds.
    ans =
         0
    ans =
         0
    >>

.

2013-07-24: Test with R2013a 64bit, 8GB, Windows 7; same computer, same OS, and new Matlab release. The results below are from the third run of the script after restarting the computer and Matlab. There is a little improvement in speed. However, nothing comparable with the result of reading a row, which Matt J report in the comment.

    >> matfile_h5_script
    Elapsed time is 2.626919 seconds.
    Elapsed time is 1.219851 seconds.
    ans =
         0
    ans =
         0
    Elapsed time is 0.809362 seconds.
    Elapsed time is 0.765147 seconds.
    ans =
         0
    ans =
         0
    Elapsed time is 0.049908 seconds.
    Elapsed time is 0.020192 seconds.
    ans =
         0
    ans =
         0
    >>

4 Comments
Show 2 older commentsHide 2 older comments

Matt J on 24 Jul 2013

Edited: Matt J on 24 Jul 2013

Windows 7 Pro (64-bit), R2012a. I have a pretty fast SSD, too. Don't know if that's a contributor.

per isakson on 24 Jul 2013

Edited: per isakson on 24 Jul 2013

I assume that you obtained the numbers after running the script a few times in a row. That is, the data are available in the system cache.

The SSD should make a large difference when running the script for the first time after restarting Windows. (I know of no other way to "clear" the system cache.)

There is a difference between our result, which I cannot explain. I have to leave it with that.

Sign in to comment.

Answer 2

Isaac Asimov on 25 Jan 2018

1
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/81232-matfile-runs-incredibly-slowly-on-large-files-what-might-be-the-problem#answer_301728

Open in MATLAB Online

Fortunately I have finally found a practical solution to this problem, that is:

----------------------------------------------------------

Put your varible in cells and then save it to a MAT-file.

Do not save it directly as a double array/matrix!

----------------------------------------------------------

I have tested this method in my computer, and the result is amazing:

tic;
m1 = matfile('var_as_matrix.mat.mat');
x1 = m1.trs_sample(1,:);
toc;
% Elapsed time is 8.686759 seconds.
tic;
m2 = matfile('var_as_cell.mat.mat');
x2 = m2.trs_sample(1,:);
y2 = x2{:};
toc;
% Elapsed time is 0.295925 seconds!

Let me do some explanations.

I have a large data set (not sparse) and I read it as a matrix. Its size is 1000x250000 (int8).

When I save the matrix directly as a MAT-File ('-v7.3'):

MATLAB automatically change the data type to `double`. (You can check this by yourself. Use matfile('YourMatFilePath'), the console will show you its properties. )
Its size is about 400 MB.
It takes about 8.7 s to assign the first row of the matrix to a variable.

When I put each row into a cell and then save it (with the same settings):

I get a 1000x1 cell, and the data type keeps `int8`.
Its size is about 200 MB. The size is reduced by half!
It takes about 0.3 s to fetch the first cell and then assign its contents to a variable. It is almost 30x faster than the upper! (I have tested this for another several times, and the speed keeps between 15x and 30x faster.)

----------------------------------------------------------

But why this method works?

I guess that, when you reconstruct the matrix to cells, the saved MAT-file is "better structured", because you build a higher hierarchy above the original matrix.

Thus, when you read the "better structured" MAT-file from the disk, MATLAB can parse and read the data more efficently. And I think that is why the method improves the performance remarkably.

----------------------------------------------------------

I hope my method can help you more or less.

Currently, apart from my solution, I have not found any other useful suggestion to this problem. This is strange, because the problem is so common and critical. And even four years after the question was asked, there is no feasible solution.

If anyone knows some more details, please share with us and post your answer here. It would be a great help to people who might encounter the similar problem.

5 Comments
Show 3 older commentsHide 3 older comments

Isaac Asimov on 26 Jan 2018

Edited: Isaac Asimov on 26 Jan 2018

@Peter Cook

The same with you, I also thought that MATLAB would not reassign everything to double, but when I checked the type of the directly saved MAT-file, I found that the type was changed. While putting the matrix into cells, the type would not change. Seeing is believing.

-----------------------------------------

I agree with your opinions on the indexing issue, since MATLAB has documented this on the tips of function matfile:

Using the end keyword as part of an index causes MATLAB to load the entire variable into memory. For very large variables, this load operation results in Out of Memory errors. Rather than using end, determine the extent of a variable, myVar, with the size method ...

But the key point of this problem may not be indexing. Even indexing a small part of a large directly saved matrix could take much time. If you do not believe, you can test this and show your code and result.

And I think that the key point is the structure of saved variables.

Peter Cook on 27 Jan 2018

Edited: Peter Cook on 27 Jan 2018

Open in MATLAB Online

This conversation got me curious, so I did a few benchmarks on my machine.

    clear,clc,close all
    d=dir('phaseVariance.mat');
    fprintf('File Size: %d bytes\n',d.bytes);
    matObj = matfile('phaseVariance.mat');
    [mRow,nCol] = size(matObj,'phaseVariance');
    %lets load a subset of the data 4 different ways and compare elapsed time
    try
        t1 = tic;
        A = matObj.phaseVariance(:,1000001:9999999);
        dt1 = toc(t1);
        clearvars A
        fprintf('A = matObj.phaseVariance(:,1000001:9999999);\n %0.1f s\n',dt1);
    catch MERR
        fprintf('A = matObj.phaseVariance(:,1000001:9999999);\n');
        disp(MERR.identifier)
        disp(MERR.message)
    end
    try
        t2 = tic;
        A = matObj.phaseVariance(1:end,1000001:9999999);
        dt2 = toc(t2);
        clearvars A
        fprintf('A = matObj.phaseVariance(1:end,1000001:9999999);\n %0.1f s\n',dt2);
    catch MERR
        fprintf('A = matObj.phaseVariance(1:end,1000001:9999999);\n');
        disp(MERR.identifier)
        disp(MERR.message)
    end
    try
        t3 = tic;
        A = matObj.phaseVariance(1:mRow,1000001:9999999);
        dt3 = toc(t3);
        clearvars A
        fprintf('A = matObj.phaseVariance(1:mRow,1000001:9999999);\n %0.1f s\n',dt3);
    catch MERR
        fprintf('A = matObj.phaseVariance(1:mRow,1000001:9999999);\n');
        disp(MERR.identifier)
        disp(MERR.message)
    end
    try
        t4 = tic;
        load('phaseVariance.mat'); A = phaseVariance(:,1000001:9999999);
        dt4 = toc(t4);
        clearvars A
        fprintf('load(''phaseVariance.mat''); A = phaseVariance(:,1000001:9999999);\n %0.1f s\n',dt4);
    catch MERR
        fprintf('load(''phaseVariance.mat''); A = phaseVariance(:,1000001:9999999);\n');
        disp(MERR.identifier)
        disp(MERR.message)
    end

File Size: 11181693932 bytes

Method 1:

A = matObj.phaseVariance(:,1000001:9999999);
54.8 s

Method 2:

A = matObj.phaseVariance(1:end,1000001:9999999);
MATLAB:array:SizeLimitExceeded
Requested 1134x25307520 (26.7GB) array exceeds maximum array size preference. Creation of arrays greater than this limit may take a long time and cause MATLAB to become unresponsive. See array size limit or preference panel for more information.

Method 3:

A = matObj.phaseVariance(1:mRow,1000001:9999999);
54.9 s

Method 4:

load('phaseVariance.mat'); A = phaseVariance(:,1000001:9999999);
MATLAB:array:SizeLimitExceeded
Requested 1134x25307520 (26.7GB) array exceeds maximum array size preference. Creation of arrays greater than this limit may take a long time and cause MATLAB to become unresponsive. See array size limit or preference panel for more information.

Did MATLAB convert my array to double precision?

>> class(A)
ans =
uint8
>> mRow
mRow =
          1134
>> nCol
nCol =
      25307520
>> sprintf('%0.0f Bytes',mRow*nCol)
ans =
28698727680 Bytes
>> d=dir('phaseVariance.mat');
>> sprintf('Size on Disk %0.0f Bytes',d.bytes)
ans =
Size on Disk 11181693932 Bytes

If I am using 8 bits per array element, I would need 28,698,727,680 bytes in memory; through some HDF5 witchcraft it only occupies 11,181,693,932 bytes on disk. Based on the size in RAM/disk it doesn't look like its been converted to a double precision number at any step on either RAM or disk, but let's double check. Since a v7.3 MAT file is actually an HDF5 file, I can use the python h5py library to open and check it:

import h5py
fptr = h5py.File('phaseVariance.mat','r')
fptr.items()
Out[3]: 
[(u'phaseVariance',
  <HDF5 dataset "phaseVariance": shape (25307520, 1134), type "|u1">)]

u1 is python-ese for a 1 byte unsigned integer, so it's not be stored on disk as a double.

But since I've already got iPython open, lets see how fast it reads that slice of my MAT file compared to MATLAB.

%timeit fptr['phaseVariance'][1000000:9999998,range(1134)]
1 loop, best of 3: 56.1 s per loop
%timeit fptr['phaseVariance'][1000000:9999998,:]
1 loop, best of 3: 1min 19s per loop
#maybe a python range object will be more efficient than a list for the larger index?
%timeit fptr['phaseVariance'][range(1000000,9999999),0:1133]
1 loop, best of 3: 1min 58s per loop

Seems the best I can load that array chunk in either MATLAB or python is about 56 seconds.

Isaac Asimov on 30 Jan 2018

Edited: Isaac Asimov on 30 Jan 2018

Open in MATLAB Online

@Peter Cook

Excellent tests and results! It's amazing that you tested so large data in so many ways. And the python codes are also illuminating. Your patience and carefulness impress me!

------------------------------------

On the issue of type conversion, you are absolutely right. I made a mistake before:

I did not notice that my data were read by fread() before saved, and what fread() returns are double values.

I ignored this and did not contain this part in my previous codes. By comparing your codes and mine, I finally found that this was the cause. My opinion was wrong.

Here is the test:

 fid = fopen('test_chars.txt');
 tmp = fread(fid,1,'uint8');
 class(tmp)

I get:

 ans =
    'double'

The type 'uint8' I specify in the fread() function only affects how MATLAB parses the data, not how it returns.

------------------------------------

Here is a more interesting question I mentioned before:

Have you ever tried to put each row of the matrix into a cell and then read only one row?

This time I will expain with codes.

Firstly, create a big matrix consisting of random numbers:

randMatrix = rand(100,500000);

Secondly, put each row of the matrix into a cell:

randCell = cell(100,1);  
for i = 1:size(randMatrix,1)
    randCell{i} = randMatrix(i,:);
end
randCell

We get a [100 x 1] cell arary and each cell is a [1 x 500000] double.

randCell =
  100×1 cell array
    [1×500000 double]
    [1×500000 double]
    [1×500000 double]
    ...

Thirdly, save the two variables into MAT-files:

save('randMatrix.mat','randMatrix','-v7.3');
save('randCell.mat','randCell','-v7.3');

Then read the first row of each file and compare the time cost:

First file (saving matrix directly):

tic;
mfMatrix = matfile('randMatrix.mat');
tmp1 = mfMatrix.randMatrix(1,:);
class(tmp1)
size(tmp1)
toc;

And it take a lot of time:

ans =
    'double'
ans =
           1      500000
Elapsed time is 2.930694 seconds.

Second file (saved as a cell array):

tic;
mfCell = matfile('randCell.mat');
tmp  = mfCell.randCell(1,1); % Get the first cell containing a row vector
tmp2 = tmp{1,:};             % Assign to a row vector
class(tmp2)
size(tmp2)
toc;

And it takes much less time:

ans =
    'double'
ans =
           1      500000
Elapsed time is 0.055028 seconds.

The absolute time cost may change among different tests, but the performance differences are always huge.

OK, let's check whether what we get in two ways are the same:

isequal(tmp1,tmp2)

Of course yes:

ans =
  logical
  1

That is what I really want to say about:

Saving a matrix as a cell array can be read faster than saving it directly.

I did not expain my idea well on my previous post.

I am glad to hear your opinions on this result, and also hope that you could point out something that I might ignore.

Mitchell Tillman on 27 Aug 2021

Open in MATLAB Online

This method worked for me too! But I'm thinking maybe it's only true for numeric array variables, not arrays within structs? I have just checked this method with a 8GB mat file comprised of one struct, and it increased my file size by 30% and the load time too. The struct is originally formatted in a matrix as below. Then, I stored each row of each data stream as a cell. Of course, I do have a small amount of metadata/other data also stored inside this struct.

Does anyone have any insight as to why @Isaac Asimov's method wouldn't work to save space/speed up loading arrays inside of a struct?

for subNum=1:10; % 10 subjects
    for trialNum=1:50; % 50 trials per subject
        for dataStreamNum=1:50; % 50 data streams per subject
            dataMatrix=rand(3,3000); % Each data stream is 3x3000
            structName.Subject(subNum).Trial(trialNum).Data(dataStreamNum).Matrix=dataMatrix; % Data in matrix form
            structName.Subject(subNum).Trial(trialNum).Data(dataStreamNum).Cell{1,1}=dataMatrix(1,:); % Cell row 1 of datastream
            structName.Subject(subNum).Trial(trialNum).Data(dataStreamNum).Cell{1,2}=dataMatrix(2,:); % Cell row 2 of datastream
            structName.Subject(subNum).Trial(trialNum).Data(dataStreamNum).Cell{1,3}=dataMatrix(3,:); % Cell row 3 of datastream
        end
    end
end

Sign in to comment.

Answer 3

per isakson on 5 Jul 2013

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/81232-matfile-runs-incredibly-slowly-on-large-files-what-might-be-the-problem#answer_90946

Edited: per isakson on 14 Sep 2021

Open in MATLAB Online

I'm neither surprised nor shocked.

Which OS, and how much RAM installed?

Your matrix is large

    >> 64000*31250*4/1e9
    ans =
         8

that is 8GB.

7.3 file format "is" HDF5.

A little experiment to show that RAM is important. (R2012a 64bit, 8GB, Windows 7)

    N = 1e4;
    filespec = 'matfile_test.mat';
    mat = rand( N, 'single' );
    save( filespec, 'mat', '-v7.3' )
    tic, 
    h5m = h5read( filespec, '/mat', [1,1], [N,1] ); toc
    tic,
    obj = matfile( filespec );
    mfm = obj.mat( :, 1 );
    toc
    d5m  = h5m-mat( :, 1 );
    dfm  = mfm-mat( :, 1 );
    max(abs(d5m(:)))
    max(abs(dfm(:)))

returns

    Elapsed time is 3.214658 seconds.
    Elapsed time is 3.499495 seconds.
    ans =
         0
    ans =
         0

Create a variable just to use RAM

>> buf = zeros( N );

and rerun the script, which now returns

    Elapsed time is 52.967529 seconds.
    Elapsed time is 52.730371 seconds.
    ans =
         0
    ans =
         0

Watch the Windows Task Manager|Performance during the reading.

9 Comments
Show 7 older commentsHide 7 older comments

Matt J on 6 Jul 2013

I seem to remember that there were ways to make the hard disk act as "fake RAM" under 32-bit MATLAB.

Michael on 23 Jul 2013

it's 64 bit matlab.

Sign in to comment.

Answer 4

Jason Climer on 11 Apr 2018

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/81232-matfile-runs-incredibly-slowly-on-large-files-what-might-be-the-problem#answer_314589

Edited: per isakson on 11 Apr 2018

I was also having problems with loading large files and traced the problem to the genericWho function in matlab.io.MatFun.

https://www.mathworks.com/matlabcentral/answers/394201-matfile-loads-variables-very-slowly#answer_314588

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 5

Thomas Richner on 8 Jul 2019

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/81232-matfile-runs-incredibly-slowly-on-large-files-what-might-be-the-problem#answer_382302

Try Tim Holy's savefast

https://www.mathworks.com/matlabcentral/fileexchange/39721-save-mat-files-more-quickly

And as a replacement for matfile, try using h5create and h5write directly--they are lower level, which is annoying, but they are faster. You can specify the block size and compression using h5create, which gives you the ability to pick your trade off between column and row. I did some benchmarking and found a block of [64 64] gives reasonable performance for 2D arrays when you later need to read back rows or columns.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Matfile runs incredibly slowly on large files--what might be the problem?

4 Comments
Show 2 older commentsHide 2 older comments

Accepted Answer

4 Comments
Show 2 older commentsHide 2 older comments

More Answers (4)

5 Comments
Show 3 older commentsHide 3 older comments

9 Comments
Show 7 older commentsHide 7 older comments

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

Matfile runs incredibly slowly on large files--what might be the problem?

4 Comments Show 2 older commentsHide 2 older comments

Accepted Answer

4 Comments Show 2 older commentsHide 2 older comments

More Answers (4)

5 Comments Show 3 older commentsHide 3 older comments

9 Comments Show 7 older commentsHide 7 older comments

0 Comments Show -2 older commentsHide -2 older comments

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

4 Comments
Show 2 older commentsHide 2 older comments

4 Comments
Show 2 older commentsHide 2 older comments

5 Comments
Show 3 older commentsHide 3 older comments

9 Comments
Show 7 older commentsHide 7 older comments

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments