How to preallocate memory for storing data in same mat file?

Hi, I wrote the below code and I would like to preallocate memory so that the code will run faster. Once I preallocate I know that I cannot use append but need to index to store output. Can you suggest how to get output for code below?
Here the value of f is a 1*5449 double. Final output is 5449*5449 double.
clc;
n=1; %system order
m=1; %number of inputs
p=6;%number of outputs
Final = [];
for i = 1:7783
for j = 1:50
if exist(['ID_',num2str(i),'_file_',num2str(j),'_Variables','.mat'],'file')
load(['ID_',num2str(i),'_file_',num2str(j),'_Variables','.mat']);
A1 = A{1};
A1 = A1 / max(abs(eig(A1)));
B1 = B{1};
C1 = C{1};
index = 1;
for k = 1:7783
for l = 1:50
if exist(['ID_',num2str(k),'_file_',num2str(l),'_Variables','.mat'],'file')
load(['ID_',num2str(k),'_file_',num2str(l),'_Variables','.mat']);
A2 = A{1};
A2 = A2 / max(abs(eig(A2)));
B2 = B{1};
C2 = C{1};
f(index) = distance1_matlab(A1,A2,B1,B2,C1,C2);
index = index + 1;
end
end
end
Final = [Final;f];
end
end
end
save('Distance','Final');

5 Comments

Hi,
This program accesses 5449 data files. The output 'Final' will be a 5449*5449 matrix.
As commented in your other question, preallocation right now is the least of your worries. Your problem is the unnecessary 29,691,601 (29 millions!) file read (for 5449 files) and associated eig calculations.
Also, since only 5449 files actually exist out of the 389,150 that you're testing it would be simpler to just ask the OS for the list of files in the directory. Is there any other mat files in that directory and if yes, do these extra files match the pattern ID_*file_*_Variables or not? Answering no to either question will simplify things.
Also, what is the size of cell arrays A, B and C (useless names!) in each mat file? (And why is the data stored in a cell array if you only use one cell?)
Oh, and what is the function distance1_matlab?
Thanks. I changed the program to this. I think this is faster. A is 10*10 double, B is 1*10 and C is 6*10. Now the structs f, o and g are 1*5449.
clc;
n=10; %system order
m=1; %number of inputs
p=6;%number of outputs
Final = [];
k = 1;
for i = 1:7783
for j = 1:50
if exist(['ID_',num2str(i),'_file_',num2str(j),'_Variables','.mat'],'file')
load(['ID_',num2str(i),'_file_',num2str(j),'_Variables','.mat']);
f{k} = A{1};
o{k} = B{1};
g{k} = C{1};
k = k+1;
end
end
end
save('Rescaled_A_Values_All_States','f');
save('Rescaled_B_Values_All_States','o');
save('Rescaled_C_Values_All_States','g');
for c = 1:5449
A1 = f{c};
A1 = A1 / max(abs(eig(A1)));
B1 = o{c};
C1 = g{c};
index = 1;
for d = 1:5449
A2 = f{d};
A2 = A2 / max(abs(eig(A2)));
B2 = o{d};
C2 = g{d};
q(index) = distance1_matlab(A1,A2,B1,B2,C1,C2);
index = index + 1;
end
Final = [Final;q];
end
Well, yes it's going to be much faster. You're reading each file only once. You're still doing N^2 unnecessary eigs and related calculations. And nearly 99% of the files you test for existence don't exist, so it'd be faster to do a dir so the OS just tells you which files are there.
Finally, depending on what distance1_matlab does, it may well be that your 2nd loop is not needed.

Sign in to comment.

 Accepted Answer

Depending on what distance1_matlab does, this code could be significantly improved.
I'm also assuming that all files that match the pattern ID_*_file_*_Variables.mat' need to be loaded.
filelist = dir('ID_*_file_*_Variables.mat'); %get list of files that exist
fileids = regexp({filelist.name}, 'ID_(\d+)_file_(\d+)_', 'tokens', 'once') %extract numeric ids as text
fileids = str2double(vertcat(fileids{:})); %and convert to numeric
%you may want to sort fileids and filelist to match the order of your original loops
%it's trivial to do. For now I assume it does not matter.
filedata = struct('A', cell(numel(filelist), 1), 'B', [], 'C', []); %preallocate structure to receive file content and final result
%note that A, B and C are very poor field names.
for fileiter = 1:numel(filelist)
filecontent = load(filelist(fileiter).name));
filedata(fileiter).A = filecontent.A{1} / max(abs(eig(A{1})));
filedata(fileiter).B = filecontent.B{1};
filedata(fileiter).C = filecontent.C{1};
end
[cartprod1, cartprod2] = ndgrid(filedata); %cartesian product of all files with themselves
distance = arrayfun(@(s1, s2) distance1_matlab(s1.A, s2.A, s1.B, s2.B, s1.C, s2.C), cartprod1, cartprod2); %assumes that the result of distance1_matlab is scalar
Note that that last line assumes distance1_matlab returns a scalar. If not, change it to:
distance = arrayfun(@(s1, s2) distance1_matlab(s1.A, s2.A, s1.B, s2.B, s1.C, s2.C), cartprod1, cartprod2, 'UniformOutput', false);
If you want the result in the same form as your original Final, then:
distance = distance(:); %if scalar result out of
distance = vertcat(distance{:}); %otherwise

2 Comments

@Guillaume
Can I use parfor instead of for to speed up execution with parallel processing? Does the loops synchronize?
I doubt that using parfor for the loading loop would help much. The slow part of that is not the processor but the disk access. If anything, it's possible that parfor will slow things down as parallel threads compete for disk access. You'll only know if you try.
I don't know if the parallel toolbox can parallelise arrayfun (I don't have the toolbox). arrayfun is a for loop in disguise. Parallelising that code could certainly result in a speed-up
However, as I've said (twice now) depending on what distance_matlab does, it's likely that this 2nd loop/arrayfun is not needed at all and that the function can be vectorised. This would probably be the most efficient way to improve your code. Hence why I asked for the details of this function.

Sign in to comment.

More Answers (0)

Categories

Find more on Loops and Conditional Statements in Help Center and File Exchange

Products

Release

R2018b

Asked:

on 20 Oct 2018

Commented:

on 26 Oct 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!