parfor error - variable cannot be classified

I have been using Matlab for a while but I am new to parallel computing. I have a set of .m files that take a long time and after profiling the code and seeing which sections suck of most of the time, I thought perpahs I can take advantage of parallel computing.
Here is one of those sections (sxyt is ~1million, numFrames = 10, numParams is ~ 18).
numParams = length(a);
sxyt = size(xytIn, 1);
DxDyOut = single(zeros(sxyt, numFrames, 2));
osc = single(zeros(sxyt, numFrames, numParams));
mag = single(zeros(sxyt, numFrames, numParams));
t = zeros(sxyt, numFrames);
randomPhase = single(zeros(1, numParams));
randomPhaseOffset = single(ones(1, numParams));
randomPhase = randomPhaseOffset .* rand(1,numParams) .* 2 .* pi;
for k = 1:numParams
osc(:,:,k) = exp((2.*pi.*f(k).*t(:,:) + randomPhase).* 1i);
mag(:,:,k) = a(k) .* (k1(k) + k2(k).*exp(-t(:,:)/(tau(k) + 0.0001)));
DxDyOut(:,:,1) = DxDyOut(:,:,1) + real( jonesx(k) .* osc(:,:,k) .* mag(:,:,k) );
DxDyOut(:,:,2) = DxDyOut(:,:,2) + real( jonesy(k) .* osc(:,:,k) .* mag(:,:,k) );
end
When I turn the for loop to parfor loop, I get "The variable DxDyOut in a parfor cannot be classified" error. I read the Matlab help on parfor and a number of the other submissions but still can't figure it out. In some earlier posts, there were double for loops which were solved by making the outer loop a parfor and leaving the inner one as a for loop to slice the data. I don't get any errors for osc and mag lines. So, this made me think that the error has to do with the fact that I am accumulating the results in the variable DxDyOut. So, I changed the for loop thusly:
A = single(zeros(sxyt, numFrames, numParams));
B = single(zeros(sxyt, numFrames, numParams));
DxDyOut2 = single(zeros(sxyt, numFrames, 2));
for k = 1:numParams
osc(:,:,k) = exp((2.*pi.*f(k).*t(:,:) + randomPhase).* 1i);
mag(:,:,k) = a(k) .* (k1(k) + k2(k).*exp(-t(:,:)/(tau(k) + 0.0001)));
A(:,:,k) = real( jonesx(k) .* osc(:,:,k) .* mag(:,:,k) );
B(:,:,k) = real( jonesy(k) .* osc(:,:,k) .* mag(:,:,k) );
end
DxDyOut2(:,:,1) = sum(A,3);
DxDyOut2(:,:,2) = sum(B,3);
I no longer get the error but this time the loop takes 10x longer to complete when using parfor. I noticed that Matlab worker used all of my 16GB RAM with parfor. A, B, mag arrays are about 750MB each and osc is 1.5 GB, with all the variables in the workspace adding up to ~4.2 GB.
So, I should have enough memory with two workers running. I don't understand why I am running out of memory. Either way, it seems like in cases where large data is manipulated, parallel computing won't help because one runs into memory limits. I would appreciate any help.
A quick update - when both forms of the for loop are run, the results are identical, i.e., isequal(DxDyOut, DxDyOut2) = 1. However when I convert the second form to parfor, isequal(DxDyOut, DxDyOut2) = 0. The first form won't work with parfor.

Answers (1)

Try
A = single(zeros(sxyt, numFrames));
B = single(zeros(sxyt, numFrames));
DxDyOut2 = single(zeros(sxyt, numFrames, 2));
for k = 1:numParams
osc(:,:,k) = exp((2.*pi.*f(k).*t(:,:) + randomPhase).* 1i);
mag(:,:,k) = a(k) .* (k1(k) + k2(k).*exp(-t(:,:)/(tau(k) + 0.0001)));
A = A + real( jonesx(k) .* osc(:,:,k) .* mag(:,:,k) );
B = B + real( jonesy(k) .* osc(:,:,k) .* mag(:,:,k) );
end
DxDyOut2(:,:,1) = A;
DxDyOut2(:,:,2) = B;

5 Comments

Thanks - that solved the memory problem but still parfor is not behaving right. parfor loop is actually taking longer than regular for loop.
Both loops with for:end gives:
Elapsed time is 16.866468 seconds.
Elapsed time is 13.860030 seconds.
ans =
DxDyOut = DxDyOut2 : 1
When I convert the second loop to parfor:end, I get:
Elapsed time is 16.195255 seconds.
Elapsed time is 25.619255 seconds.
ans =
DxDyOut = DxDyOut2 : 0
As you see, not only it takes longer, it also gives a different result. I already verified that osc/osc2 and mag/mag2 are identical. Below is the code I used to get the above:
tic
DxDyOut2 = single(zeros(sxyt, numFrames, 2));
osc2 = single(zeros(sxyt, numFrames, numParams));
mag2 = single(zeros(sxyt, numFrames, numParams));
for k = 1:numParams
osc2(:,:,k) = exp((2.*pi.*f(k).*t(:,:) + randomPhase(k)).*1i);
mag2(:,:,k) = a(k) .* (k1(k) + k2(k).*exp(-t(:,:)/(tau(k) + 0.0001)));
DxDyOut2(:,:,1) = DxDyOut2(:,:,1) + real( jonesx(k) .* osc2(:,:,k) .* mag2(:,:,k) );
DxDyOut2(:,:,2) = DxDyOut2(:,:,2) + real( jonesy(k) .* osc2(:,:,k) .* mag2(:,:,k) );
end
toc
d2 = DxDyOut2(:,1,:);
clear osc2 mag2 DxDyOut2
% Slightly faster method - also parallel computing friendly
A = single(zeros(sxyt, numFrames));
B = single(zeros(sxyt, numFrames));
osc = single(zeros(sxyt, numFrames, numParams));
mag = single(zeros(sxyt, numFrames, numParams));
DxDyOut = single(zeros(sxyt, numFrames, 2));
tic
for k = 1:numParams
%parfor k = 1:numParams
osc(:,:,k) = exp((2.*pi.*f(k).*t(:,:) + randomPhase(k)).*1i);
mag(:,:,k) = a(k) .* (k1(k) + k2(k).*exp(-t(:,:)/(tau(k) + 0.0001)));
A = A + real( jonesx(k) .* osc(:,:,k) .* mag(:,:,k) );
B = B + real( jonesy(k) .* osc(:,:,k) .* mag(:,:,k) );
end
DxDyOut(:,:,1) = A;
DxDyOut(:,:,2) = B;
toc
d = DxDyOut(:,1,:);
['DxDyOut = DxDyOut2 : ' num2str(isequal(d, d2))]
You have neglected floating point roundoff. Remember in floating point, P+Q+R might not be the same as P+R+Q . parfor does the addition reductions in an unspecified order that might vary dynamically (e.g., whatever is ready first.) If you require bit-for-bit reproduction of the for-loop results then parfor reduction variables are not an appropriate tool.
parfor can take longer if there is not enough work per iteration, due to the overhead of creating the tasks and coordinating them. One technique to reduce that is to "unroll". For example, presuming numParams is even,
parfor k = 1:2:numParams
osc(:,:,k) = exp((2.*pi.*f(k).*t(:,:) + randomPhase(k)).*1i);
mag(:,:,k) = a(k) .* (k1(k) + k2(k).*exp(-t(:,:)/(tau(k) + 0.0001)));
om = osc(:,:,k) .* mag(:,:,k);
A1 = real( jonesx(k) .* om );
B1 = real( jonesy(k) .* om );
osc(:,:,k+1) = exp((2.*pi.*f(k+1).*t(:,:) + randomPhase(k+1)).*1i);
mag(:,:,k+1) = a(k+1) .* (k1(k+1) + k2(k+1).*exp(-t(:,:)/(tau(k+1) + 0.0001)));
om = osc(:,:,k+1) .* mag(:,:,k+1);
A = A + A1 + real( jonesx(k+1) .* om );
B = B + B1 + real( jonesy(k+1) .* om );
end
drange() might also be a useful mechanism in the case when each worker does not do enough work.
for k = drange(1:numParams)
osc(:,:,k) = exp((2.*pi.*f(k).*t(:,:) + randomPhase(k)).*1i);
mag(:,:,k) = a(k) .* (k1(k) + k2(k).*exp(-t(:,:)/(tau(k) + 0.0001)));
A = A + real( jonesx(k) .* osc(:,:,k) .* mag(:,:,k) );
B = B + real( jonesy(k) .* osc(:,:,k) .* mag(:,:,k) );
end
This allocates chunks of k to workers, each chunk to be done by the same worker. The order that the chunks will be given the individual k is unspecified, so this has the same limitation about round-off error.
Thank you ! I tried the "drange" suggestion. It helped but not enough. On my laptop I have a dual core i7 (4 processes with hyper-threading).
Below are the timing I get. Note I have two forms of the for loop in my example above - your suggestion of "A = A + ..." version is faster. The other form doesn't work with parfor. So, in each case I list timing for both forms.
for loop without parallel processing:
Elapsed time is 17.160106 seconds.
Elapsed time is 13.927942 seconds.
parfor loop w/ two workers:
Elapsed time is 16.593184 seconds.
Elapsed time is 29.215239 seconds.
for loop with "drange":
Elapsed time is 15.835708 seconds.
Elapsed time is 13.865574 seconds.
So, while "drange" helps, it doesn't do any better than the for loop without any parallel processing.
Could you try it with double precision and see how the speed changes?

Sign in to comment.

Categories

Asked:

on 12 May 2015

Commented:

on 15 May 2015

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!