How to operate with large arrays of structs

Hello, I am doing a Markov Chain Monte Carlo Simulation where I want to store many sampled states. I have the following data structure:
state(1) = struct('dim', 3 ,'coords',rand(3,1), 'vals', rand(3,1));
state(10000) = struct('dim', [], 'coords', [], 'vals', []);
for i = 2:10000
state(i) = generateNewState(state(i-1));
end
How can I store my generated state-data and proceed with the next 10000 states? Then append them to the existing .mat file and go on until I generated say 1e10 states. And then use the data to do calculations? My problem is that the dimension (up to 10000) of the struct is not fixed. The other problem is that I dont want to load the whole mat file into my memory since it wouldn't fit. I would like to process the data in chunks. By processing I mean calculations of mean, variance, covariance, max, min , extraction of every 100th sample, creating histogram without knowing the domain etc...
I already tried the map-reduce formalism but there I had to limit myself to a maximum dimension and I had to fill up every struct of smaller dimensions with NaN's in order to be able to store the structs as a table in a csv file. But this can't be the right way to do it because maybe I will just need 10 dimensions but 10000 are theoretically possible. So I would have a really sparse table... It just depends on the data which I don't know in advance. So has anybody a good idea how to solve it?
Thanks in advance!

4 Comments

Note that using large arrays of structures (or a cell array of numerous matrices) is extremely memory inefficient in matlab. Each field of each element of the structure carries the overhead of a matrix, so with 10,000 elements and 3 fields, you have 30,000 time that overhead.
If the corresponding fields of each struct array element are always the same, you're better off using a big matrix for each field and a scalar struct.
%on R2015b
coord = rand(3, 10000); %one big 3x10000 array
state = struct('coord', num2cell(rand(3, 10000), 1)); %a 1x10000 struct with a 3x1 field
whos coord state
Name Size Bytes Class Attributes
coord 3x10000 240000 double
state 1x10000 1360064 struct
The structure uses nearly 6 times the amount of memory for the same content. Note that whos does not show the overhead for plain matrices, so coorrd actually uses a bit more memory than reported.
I think the large overhead comes from the cell structure, which I am not using. I compared following data structure:
state(15000) = struct('x',[],'y',[],'z',[]);
for i = 1:15000
state(i) = struct('x',rand(1,10000),'y',rand(1,10000),'z',rand(1,10000));
end
Which took about 3.36GB Ram - that is ok for me if I can write this data to a file on my Harddrive and proceed with the next states. In copmarison to the matrix representation:
coord = rand(15000,3*10000);
which took about 3.35GB Ram. So there is just an overhead of some MB which I dont really care since the data struct makes my code a lot more readable. Also I had the feeling that there was no performance issue. The two methods were somehow equally fast. So I dont really consider changing my data strcture since I dont really see an advantage in it. In fact the struct things was about 15% slower then the matrix approach. But this doesnt bother me too much.
I was just wondering whether I can use sparse matrices within a struct since many matrix entries would be 0.
I did following more realistic comparison: state(10000) = struct('x',[],'y',[],'z',[]);
for i = 1:10000
state(i) = struct('x',rand(1,10000),'y',rand(1,10000),'z',rand(1,10000));
end
vs.
coord(10000,30000) = 0;
for i = 1:10000
coord(i,:) = rand(1,3*10000);
end
My result was a bit unexpected because the matrix version was a lot slower! So I will just stick to my struct version. Also I experienced some weird memory behaviour. When I preallocated my matrix with "coord(10000,30000) = 0" I woul see a linear increase in memory during the inner for loop. But when I preallocate with "coord = zeros(10000,30000)" I wouldn't see an instant increase in memory usage and it will stay constant during the for loop. Also the time for the first option is longer than the second one. So whats happen internaly?
The overhead has nothing to do with the cell. It's simply due to the fact that you allocate 15000x3 matrices for your structure, all of which need memory to track their size, type, etc.
With your example, the structure uses about 5 MB more (5,040,192 bytes exactly in 2015b) than the matrix.
But, yes, if the data you store takes over 3 GB, 5 MB becomes less significant.
You can of course store sparse matrices in a struct, but the overhead of the sparse matrices may be more than you save.

Sign in to comment.

Answers (1)

I do not understand the question. How can you store the data? The shown code works, doesn't it? So is the first question solved already? You process with the next 10'000 by simply calling your code again. You can store the state variables in a cell array. There are different methods to append this to an existing MAT file. But a binary file seems to be more efficient in this case. Especially if you want to read it partially only.
A compact and efficient file format could be:
number of dimensions as uint64
coordinates as double vector
vals as double vector
This can be read by a simply loop. You can skip a variable or read as many variables until the memory is filled. Using powerful MAT files for this job is far too complicated.

2 Comments

Could you please give me some details on this? Are there any routines for saving a struct in a binary file? And especially readinf data from a binary file back to memory?
There are no standard functions for your specific job. But they are easy to write using fwrite and fread:
% For writing the array:
fid = fopen(FileName, 'w');
if fid == -1, error('Cannot open file: %s', FileName); end
% First value: Total number of elements:
fwrite(fid, numel(state), 'uint64');
for k = 1:numel(state);
fwrite(fid, state(k).dim, 'uint64');
fwrite(fid, state(k).coords, 'double');
fwrite(fid, state(k).vals, 'double');
end
fclose(fid);
% For reading:
fid = fopen(FileName, 'r');
if fid == -1, error('Cannot open file: %s', FileName); end
% First value: Total number of elements:
num = fread(fid, 'uint64');
% Pre-allocate:
state(num) = struct('dim', [], 'coords', [], 'vals', []);
for k = 1:numel(state);
dim = fwrite(fid, 1, 'uint64');
state(k).dim = dim;
state(k).coords = fread(fid, dim, 'double');
state(k).vals = fread(fid, dim, 'double');
end
fclose(fid);
I cannot debug this, because I cannot run Matlab currently. I think the strategy is clear, so please adjust this to yout needs.

Sign in to comment.

Categories

Asked:

on 29 Sep 2015

Commented:

Jan
on 30 Sep 2015

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!