Parallel computation leaking memory and slowing a loop

25 views (last 30 days)
I am doing a parallel computation, specifically a row by row integration of a symbolic matrix (sic I am integrating 4 rows at a time on my quadcore). 2 curious things occur:
(1) As the computation progresses, the free RAM steadily gets used up (see graph below). The initial surge in RAM is expected - it is the memory that is "actually" required by the computation. However, I am at a loss to explain why the remaining RAM then steadily continues to be used up. The issue is that eventually, after maybe an hour or so of computation and many rows having been integrated, the RAM is used up. MATLAB then starts to use virtual memory on the HDD, which decreases computation speeds to unacceptable levels (5 fold or so decrease in speed, with further slowing over time). I can circumvent this problem by dividing my matrix into smaller matrices (with fewer rows) that can individually be integrated before HDD switching is required. However, in my estimation this should not be necessary. I have plenty of RAM (16GB) and the actual computation only requires about 2900MB judging by the kink in the graph. Is MATLAB saving output to the RAM? What else could be messing up the memory allocation? The CPU effort never gets above ~65% and is typically around 40%.
(2) Much more concerning. Before the RAM is used up, a row integration typically takes about 60s per row. However, I just discovered that if I close the worker pool and effectively "reset" my processors (or instead simply restart matlab) and I manually instruct the same exact read/compute/write as is done in the loop, the same computation takes only fractions of a second (see command window dialogue below). I suspect this may be related to (1). This is completely unacceptable, why might this happening?
MATLAB command window output for integration of the first 4 rows:
Starting matlabpool using the 'local' profile ... connected to 4 workers.
K_a_12
Elapsed time is 8.039642 seconds.
Commencing integration...
currently integrating row 1Elapsed time is 63.642951 seconds.
currently integrating row 3Elapsed time is 64.052436 seconds.
currently integrating row 4Elapsed time is 63.787564 seconds.
currently integrating row 2Elapsed time is 64.686729 seconds.
>> tic
K_a_12(2,:)=2*h_a*int(int(K_a_12_unintegrated(2,:),y,0,W),x);
toc
Elapsed time is 49.989623 seconds.
>> matlabpool close
Sending a stop signal to all the workers ... stopped.
>> %Assign 4 cores
if matlabpool('size') == 0
matlabpool open 4
end
Starting matlabpool using the 'local' profile ... connected to 4 workers.
>> tic
K_a_12(2,:)=2*h_a*int(int(K_a_12_unintegrated(2,:),y,0,W),x);
toc
Elapsed time is 0.360260 seconds.
>>
My source code is attached below (note that symbolic toolbox is required). If you do not have a quadcore processor, you can change matlabpool open 4 to matlabpool open [# of processors on your machine]:
%%Initialization
%Clear variables
clear;
clc;
%Assign 4 cores
if matlabpool('size') == 0
matlabpool open 4
end
%%Problem description
syms G_a E_a h_a L1 L2 La W h_1 h_2 A1 B1 D1 G1 a11 a12 a16 a22 a26 a66 b11 b12 b16 b22 b26 b66 d11 d12 d16 d22 d26 d66 a44 a45 a55
% Laminate 1
A1=[a11,a12,a16;a12,a22,a26;a16,a26,a66];
B1=zeros(3);
D1=[d11,d12,d16;d12,d22,d26;d16,d26,d66];
G1=[a44,a45;a45,a55];
%Laminate 2
A2=A1;
B2=B1;
D2=D1;
G2=G1;
%%Assumed Displacement Field
m=10;
n=5;
syms x y z
B11 = sym(zeros(m+1, 1));
B21 = sym(zeros(m+1, 1));
B12 = sym(zeros(n+1, 1));
B22 = sym(zeros(n+1, 1));
for i = 0:m
A=factorial(m)/(factorial(i)*factorial(m-i))*((x-(L2-La))/L1)^i*(1-(x-(L2-La))/L1)^(m-i);
B11(i+1)=A;
end
for i = 0:n
B=factorial(n)/(factorial(i)*factorial(n-i))*(y/W)^i*(1-y/W)^(n-i);
B12(i+1)=B;
end
for i = 0:m
A=factorial(m)/(factorial(i)*factorial(m-i))*(x/L2)^i*(1-x/L2)^(m-i);
B21(i+1)=A;
end
for i = 0:n
B=factorial(n)/(factorial(i)*factorial(n-i))*(y/W)^i*(1-y/W)^(n-i);
B22(i+1)=B;
end
zero_vector=zeros((m+1)*(n+1),1);
counter=0;
for i = 0:m
for j=0:n
counter=counter+1;
C=B11(i+1)*B12(j+1);
Vx1(counter,1)=C; %x-displacement
Vy1(counter,1)=C; %y-displacement
Vz1(counter,1)=C; %z-displacement
Vrx1(counter,1)=C; %x-rotation
Vry1(counter,1)=C; %y-rotation
end
end
Vx1=[Vx1;zero_vector;zero_vector;zero_vector;zero_vector];
Vy1=[zero_vector;Vy1;zero_vector;zero_vector;zero_vector];
Vz1=[zero_vector;zero_vector;Vz1;zero_vector;zero_vector];
Vrx1=[zero_vector;zero_vector;zero_vector;Vrx1;zero_vector];
Vry1=[zero_vector;zero_vector;zero_vector;zero_vector;Vry1];
counter=0;
for i = 0:m
for j=0:n
counter=counter+1;
C=B21(i+1)*B22(j+1);
Vx2(counter,1)=C; %x-displacement
Vy2(counter,1)=C; %y-displacement
Vz2(counter,1)=C; %z-displacement
Vrx2(counter,1)=C; %x-rotation
Vry2(counter,1)=C; %y-rotation
end
end
Vx2=[Vx2;zero_vector;zero_vector;zero_vector;zero_vector];
Vy2=[zero_vector;Vy2;zero_vector;zero_vector;zero_vector];
Vz2=[zero_vector;zero_vector;Vz2;zero_vector;zero_vector];
Vrx2=[zero_vector;zero_vector;zero_vector;Vrx2;zero_vector];
Vry2=[zero_vector;zero_vector;zero_vector;zero_vector;Vry2];
Vxx1=diff(Vx1,x);
Vyy1=diff(Vy1,y);
Vxy1=diff(Vx1,y);
Vyx1=diff(Vy1,x);
Vzx1=diff(Vz1,x);
Vzy1=diff(Vz1,y);
Vrxx1=diff(Vrx1,x);
Vryy1=diff(Vry1,y);
Vrxy1=diff(Vrx1,y);
Vryx1=diff(Vry1,x);
Vxx2=diff(Vx2,x);
Vyy2=diff(Vy2,y);
Vxy2=diff(Vx2,y);
Vyx2=diff(Vy2,x);
Vzx2=diff(Vz2,x);
Vzy2=diff(Vz2,y);
Vrxx2=diff(Vrx2,x);
Vryy2=diff(Vry2,y);
Vrxy2=diff(Vrx2,y);
Vryx2=diff(Vry2,x);
%Strain matrices
B_epsilon_1=[Vxx1.';Vyy1.';Vxy1.'+Vyx1.'];
B_kappa_1=[Vrxx1.';Vryy1.';Vrxy1.'+Vryx1.'];
B_gamma_1=[Vzy1.'-Vry1.';Vzx1.'-Vrx1.'];
B_epsilon_2=[Vxx2.';Vyy2.';Vxy2.'+Vyx2.'];
B_kappa_2=[Vrxx2.';Vryy2.';Vrxy2.'+Vryx2.'];
B_gamma_2=[Vzy2.'-Vry2.';Vzx2.'-Vrx2.'];
B_a_1=1/(2*h_a)*[(-Vz1).';(-Vx1).'+h_1*Vrx1.';(-Vy1).'+h_1*Vry1.'];
B_a_2=1/(2*h_a)*[Vz2.';Vx2.'+h_2*Vrx2.';Vy2.'+h_2*Vry2.'];
B_1=[B_epsilon_1;B_kappa_1;B_gamma_1];
B_2=[B_epsilon_2;B_kappa_2;B_gamma_2];
C1=[A1,B1,zeros(3,2);B1,D1,zeros(3,2);zeros(2,6),G1];
C2=[A2,B2,zeros(3,2);B2,D2,zeros(3,2);zeros(2,6),G2];
Ca=[E_a,0,0;0,G_a,0;0,0,G_a];
%------------------------------------
%!!! THIS IS WHERE THE PROBLEMS OCCUR
%------------------------------------
%K_a_12
disp('K_a_12')
tic
K_a_12_unintegrated=B_a_1.'*Ca*B_a_2;
K_a_12_unintegrated = reshape(K_a_12_unintegrated,1089,100);
K_a_12=sym(zeros(1089,100));
toc
fprintf('Commencing integration...\n');
parfor i=1:1089
tic
fprintf('currently integrating row %d',i);
K_a_12(i,:)=2*h_a*int(int(K_a_12_unintegrated(i,:),y,0,W),x);
toc
end
return;
  1 Comment
Robert Niemeyer
Robert Niemeyer on 28 Dec 2015
My own research uses the Symbolic Toolbox and here is what I've encountered. Similar to your problems, I believe.
I'm doing work in mathematical billiards (specifically, fractal billiards) and I use a parallel for loop to determine where the billiard ball (point mass) is going to collide with the boundary next. Basically, the parallel for loop solves the problem of determining where a line intersects with a convoluted planar billiard table (a pre-fractal approximation). This is not really relevant to your problem, but providing some background. The reason for the parfor loop is to speed up computation on a multi-core processor. Usually, I get about 5-6 times improvement in speed.
But, the memory blows up considerably. I use all 18-gigs of RAM available on my system in about 24 hours of running a simulation. It makes no sense. But, and I can't remember where I read this, the problem is that MuPad (the underlying computer algebra system used by the Symbolic Toolbox) stores EVERY SINGLE MINOR SYMBOLIC COMPUTATION your code has done and then some. The result is a massive list of symbolic computations that can be cleared by reset(symengine) and clear all.
But the problem is that when you reset the symengine and clear symbolic variables, you need to reset them if they are used again. In my case, I reset the symengine and clear all variables after each collision in a boundary with (let's say) 12*2^n many segments, where n is a positive integer. The result of clearing the variables and resetting the symbolic engine is a dramatic savings in memory (now only about 200+ megabytes instead of 18 gigs) over the course of what should be a 24-hour period.
But, two things happen:
1) Clearing and resetting costs time and with 10,000+ bounces in a convoluted shape, 1 additional second per bounce is nearly 2.5 hours of extra computing time and
2) The simulation doesn't actually run because the parfor loop seems to not be behaving well with 10,000+ resets of the symbolic engine.
I would like to believe I've guessed correctly that when Matlab does any parallelization it doesn't keep track of what it sends where (i.e., how each worker is accessing the memory and regrouping after each computation).
Matlab sends something to MuPad. There are 12 instances of MuPad open on a 6-core system (with two threads per core) and each is trying to access a list of computations and some error results.
This may or may not be your problem, but I've been thoroughly frustrated with the memory issue until I realized (at least I'm testing it right now as we speak, er, write) that the parfor loop was causing the problem.
So, no parfor loop results in a happy resetting of the symbolic engine and clearing of variables. You may not have any choice since it sounds like Matlab is doing the parallelization automatically for you. But I do see a parfor loop. If you don't use that, you may be in better shape. Just a suggestion. Please let me know if I can explain anything better. This is my first time commenting on something like this.

Sign in to comment.

Answers (4)

Ben
Ben on 29 Oct 2013
Edited: Ben on 29 Oct 2013
i can confirm a memory leak using parfor on 2013b linux. avoidable by using a regular for loop. fixable, albeit slowly, by closing matlabpool and re-opening.

Robert Niemeyer
Robert Niemeyer on 28 Dec 2015
My own research uses the Symbolic Toolbox and here is what I've encountered. Similar to your problems, I believe.
I'm doing work in mathematical billiards (specifically, fractal billiards) and I use a parallel for loop to determine where the billiard ball (point mass) is going to collide with the boundary next. Basically, the parallel for loop solves the problem of determining where a line intersects with a convoluted planar billiard table (a pre-fractal approximation). This is not really relevant to your problem, but providing some background. The reason for the parfor loop is to speed up computation on a multi-core processor. Usually, I get about 5-6 times improvement in speed.
But, the memory blows up considerably. I use all 18-gigs of RAM available on my system in about 24 hours of running a simulation. It makes no sense. But, and I can't remember where I read this, the problem is that MuPad (the underlying computer algebra system used by the Symbolic Toolbox) stores EVERY SINGLE MINOR SYMBOLIC COMPUTATION your code has done and then some. The result is a massive list of symbolic computations that can be cleared by reset(symengine) and clear all.
But the problem is that when you reset the symengine and clear symbolic variables, you need to reset them if they are used again. In my case, I reset the symengine and clear all variables after each collision in a boundary with (let's say) 12*2^n many segments, where n is a positive integer. The result of clearing the variables and resetting the symbolic engine is a dramatic savings in memory (now only about 200+ megabytes instead of 18 gigs) over the course of what should be a 24-hour period.
But, two things happen:
1) Clearing and resetting costs time and with 10,000+ bounces in a convoluted shape, 1 additional second per bounce is nearly 2.5 hours of extra computing time and
2) The simulation doesn't actually run because the parfor loop seems to not be behaving well with 10,000+ resets of the symbolic engine.
I would like to believe I've guessed correctly that when Matlab does any parallelization it doesn't keep track of what it sends where (i.e., how each worker is accessing the memory and regrouping after each computation).
Matlab sends something to MuPad. There are 12 instances of MuPad open on a 6-core system (with two threads per core) and each is trying to access a list of computations and some error results.
This may or may not be your problem, but I've been thoroughly frustrated with the memory issue until I realized (at least I'm testing it right now as we speak, er, write) that the parfor loop was causing the problem.
So, no parfor loop results in a happy resetting of the symbolic engine and clearing of variables. You may not have any choice since it sounds like Matlab is doing the parallelization automatically for you. But I do see a parfor loop. If you don't use that, you may be in better shape. Just a suggestion. Please let me know if I can explain anything better. This is my first time commenting on something like this.

Arvid Terzibaschian
Arvid Terzibaschian on 16 Oct 2014
I must confirm the memory leak aswell. Running on a 64-bit linux cluster, version is 8.1.0.604 (R2013a) Closing and reopening the pool fixes the problem.
Becomes really nasty if you work on a shared cluster and you have to wait several hours for a matlabpool to get opened because of task queues. Renders matlab parallel toolbox practically unusable for big memory computation under those circumstances.
A fix or improved workaround would really be appreciated.

ervinshiznit
ervinshiznit on 18 Jun 2018
I still experience this in MATLAB 2018a on Linux.
I agree with Rober's conclusion. Unfortunately for my program if I run it without a parfor it just takes way too long to finish. I had not considered resetting the symbolic engine, though considering I do about 60,000 iterations of the same computation (with random input), adding one second to each iterations is probably going to be brutal.
Forutnately, because this is a Monte Carlo simulation, I can just run the program multiple times with fewer iterations per time, and have it finish before it starts running out of memory. I will say though that it does eventually run out of memory if I tell it to run too many iterations, despite having 32 GB of RAM!

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!