Performance of log() is wildly different in two different contexts and machines for same data?

I am writing code to fit 2D Gaussian functions to white spots on a black image background for a research project. I recently got a new 16 GB RAM 2020 MacBook Pro and was comparing its performance to my old 8 GB RAM 2020 MacBook Pro. When profiling my code on the 16 GB machine, I saw that calling the log() function 10,000 times as part of
fminsearch
took approximately 1.5 seconds per data point. This came as a surprise to me, as running the same code on the 8 GB machine takes approximately 0.01 seconds per data point. The net result is that the overall code takes approximately 6 times longer to run on the 16 GB machine which is unacceptable for my purposes.
Next I checked the CPU load on both machines, reasoning that log() might not be parallelized for some reason on the 16 GB machine, but both machines have very similar CPU load while running the code. I also reinstalled MATLAB on the 16 GB machine with the same toolboxes to see if there was some issue in how I had transferred data from the 8 GB machine to the 16 GB machine, but that did not help.
Each data point is a 9 x 9 (edit: 7 x 7) array of doubles. To investigate further I moved the offending code into a new script on the 16 GB machine and saved a single data point in the workspace. I also added a for loop beneath where I took the log of the data in the matrix 10,000 times. This reproduced the results seen above. (Edit for clarity - I mean that taking the log of the data in the matrix in the for loop is taking the expected 0.01 seconds. It is only in the optimization routine on the 16 GB machine that I'm seeing 1.5 second/log() call runtimes.) So the performance of log() varies reproducibly even on the same machine.
The code in question is below. The installed toolboxes on both machines are identical, except that on the 8 GB machine I have installed the Symbolic Math and Computer Vision toolboxes, while on the 16 GB machine I have installed the Wavelet and Optimization toolboxes. Otherwise both machines have the Bioinformatics, Curve Fitting, Image Processing, and Stats/Machine Learning toolboxes. Both machines are running R2019a (I needed my code to be compatible with older versions of MATLAB and I haven't experienced any other issues so far). Might anyone know what the issue is, and how I could fix it?
Offending code:
x = 491;
y = 16;
circle_rad = 3;
peakdiam = 7;
mask = [15 12 13 13 14 13 11;
12 13 12 12 13 16 15;
15 16 15 14 16 17 13;
21 19 27 40 30 19 11;
15 16 26 31 19 19 14;
12 11 18 30 19 12 14;
13 14 15 14 13 13 13;]
mask = 7×7
15 12 13 13 14 13 11 12 13 12 12 13 16 15 15 16 15 14 16 17 13 21 19 27 40 30 19 11 15 16 26 31 19 19 14 12 11 18 30 19 12 14 13 14 15 14 13 13 13
MLE_initial_guess = [circle_rad + 1, circle_rad + 1, peakdiam, sum(mask(:)), min(mask(:))];
neg_log_likelihood_fxn = @(params) NegLogFxn(params, mask, peakdiam) ;
options = optimset ('MaxFunEvals', 10000, 'MaxIter', 10000, 'TolFun', 1e-5);
MLE_output = fminsearch(neg_log_likelihood_fxn, MLE_initial_guess, options);
pixel_offset = [y - circle_rad - 1, x - circle_rad - 1]; % -1 corrects for the +1 offset in the initial guess
MLE_output(1:2) = MLE_output(1:2) + pixel_offset;
fprintf('Y-COORDINATE;\nMLE: %f \nX-COORDINATE:\nMLE: %f \n\n', MLE_output(1), MLE_output(2))
Y-COORDINATE; MLE: 16.610009 X-COORDINATE: MLE: 490.979523
NegLogFxn:
function neglog = NegLogFxn(params, data, array_size)
tdg = twoD_Gaussian(params, array_size);
% disp(tdg)
% neglog = -tdg;
field_of_view = data;
tdg_log = log(tdg);
field_of_view = field_of_view .* tdg_log;
field_of_view = field_of_view - tdg;
neglog = -1 * sum(field_of_view(:));
end
twoD_Gaussian:
function psf = twoD_Gaussian(params, array_size)
[x, y] = meshgrid(1:array_size, 1:array_size);
psf = params(4)/(2 * pi * params(3)^2);
psf = psf * exp(-((x - params(2)).^2 + (y - params(1)).^2)/(2 * params(3)^2));
psf = psf + params(5);
end
Thanks in advance!

17 Comments

"So the performance of log() varies reproducibly even on the same machine." - Then the description of the different machine is not useful.
Please explain, how you observe the difference in the processing speed.
"I also added a for loop beneath where I took the log of the data in the matrix 10,000 times. This reproduced the results seen above." - please post the corresponding code.
"Please explain, how you observe the difference in the processing speed."
I'm not entirely certain what you mean by this statement, but I measured the difference using MATLAB's profiling tool in all contexts.
"please post the corresponding code."
mask = [13 10 12 14 12 11 14 9 15;
11 15 12 13 13 14 13 11 14;
11 12 13 12 12 13 16 15 13;
13 15 16 15 14 16 17 13 14;
13 21 19 27 40 30 19 11 13;
14 15 16 26 31 19 19 14 12;
11 12 11 18 30 19 12 14 13;
12 13 14 15 14 13 13 13 12;
12 15 12 16 12 11 11 11 11]
for i = 1:10000
log(mask);
end
What timing results do you get from the following on the different machines?
tic
for i = 1:10000
log(mask);
end
toc
T=randi(1e6,1);
tic
log(T);
toc
On the 16 GB machine the first task took 0.011625 seconds and the second task took 0.001089 seconds. On the 8 GB machine the first task took 0.010853 seconds and the second task took 0.001135 seconds.
This seems to contradict what you said earlier. The loop timings look very similar on both machines. At the very least, there seems to be an inconsistency between what the profiler reports and what tic/toc reports.
Perhaps I wasn't specific enough in my original post. The only place that I'm seeing significantly increased time for the log() function is where it's being called in the optimization routine on the 16 GB machine. If I'm calling it by itself on the 16 GB machine, or if I'm calling it at all on the 8 GB machine, I am getting the expected performance.
My code is identical on both machines too so I have no idea what could be wrong.
Then I come to my initial question again: If you see the difference in the processing speed of log() only, if it is called inside the function to be optimized, how do you know this? Why are you sure, that it is the log() function, which makes the difference?
It is possible, that your observation has another reason. The output of the profiler is not as clear as some tic/tocs: The profiler diesables some optimizations by Matlab's JIT accelerator, e.g. the reordering of the commands. In an expensive for-loop the profiler can claim, that the closing end consumes an important amount of time.
"reasoning that log() might not be parallelized" - please explain, why you assume this.
Currently I still assume, that the difference of the speed has another reason.
@Lakshay Sood Your fitting routine seems to be based on a Poisson image noise model. Since you have the Optimization Toobox on your 16 GB machine, and if you are happy to accept a Gaussian noise model, at least temporarily, you could try gaussfitn,
If gaussfitn proves to be faster, this might at least let you progress better until the issue with log() is better understood.
Apologies for the delay in responding.
@Jan: The reason I am sure that the difference is in the processing speed of log() is because I saw that the profiler gives you the time it takes to run each line of code. The line
tdg_log = log(tdg);
took 1.5 seconds to run, according to the profiler, when I ran the profiler on the "Offending Code" snipped posted above.
Acting on your suggestion regarding the profiler I decided to time the performance of log() using tic/toc by modifying my NegLogFxn:
function neglog = NegLogFxn(params, data, array_size)
tdg = twoD_Gaussian(params, array_size);
% disp(tdg)
% neglog = -tdg;
field_of_view = data;
tic
tdg_log = log(tdg);
toc
field_of_view = field_of_view .* tdg_log;
field_of_view = field_of_view - tdg;
neglog = -1 * sum(field_of_view(:));
end
Then I ran the "offending code" snippet again. Again I got 1.5 seconds for that line, using tic/toc.
Interestingly enough I tested the performance of log() on the same data point within the for loop that you requested the code for above. This was done with tic/toc as shown below:
tic
for i = 1:1:10000
log(mask);
end
toc
In this context log() only takes 0.008 seconds to operate 10,000 times.
""reasoning that log() might not be parallelized" - please explain, why you assume this."
I saw a link to another MATLAB question (https://www.mathworks.com/matlabcentral/answers/53794-log-much-slower-in-parallel) talking about the parallelization of log() and thought that that might have something to do with it.
@Matt J: You are correct in assuming that my fitting routine is based on a Poisson noise model. Unfortunately I can't use a Gaussian model as I need the theoretically minimum variance in my estimate. If we are unable to find a solution to this problem I may switch to using the Gaussian model, but I would definitely prefer to use the Poisson model if possible.
It still might be worth using the solution from a Gaussian model to initialize the Poisson model optimization in the interest of reducing the number of fminsearch iterations required. fminsearch is not built for speed. For that matter, it would be worth reimplementing the Poisson optimization with fmincon, with SpecifyObjectiveGradient set to 'true'.
Hi Matt, I tried using fminunc for my purposes. I noticed a 2x speedup on the 16 GB machine relative to fminsearch, but the log() function is still taking far too long within the optimization routine relative to when I call it by itself outside of the routine.
The efficiency of
tic
for i = 1:1:10000
log(mask);
end
toc
can be caused by a dead-code-elimination of Matlab's JIT acceleration. Try this:
function cuteTest
mask = rand(16, 16);
tic
for i = 1:10000
log(mask);
end
toc
mask = rand(16, 16);
tic
for i = 1:10000
mask(1) = mask(1) + 1;
q = log(mask);
end
toc
end
In Matlab R2018b the speed is equal, but modern Matlab version might optimize the code more aggressively.
Have you checked how many iterations fminsearch is executing on each machine? More generally, have you compared the "output" argument returned on each machine,
[x,fval,exitflag,output] = fminsearch(___)
to see if both give the same termination conditions. Maybe, due to some numerical subtlety, the optimization is just running for more iterations, or doing more function evaluations, on the 16 GB machine.
In any case, could you post screenshots of your profiling results?
@Jan: The speed was equal for me in R2019a as well.
@Matt J: For the data point described in "Offending Code", both machines perform 10000 or 10001 iterations. I am including screenshots of my profiling results in this comment. Please ignore the differences in line numbers; this is just due to differences in commenting on both machines, but the code being run is identical.
@Lakshay Sood If you look again at your original post, you will see that I have edited it and applied the Run button to your code. The code throws an error message, and does not run to completion. So, whatever tests results you have been showing us, it is for different code. It would be helpful to have the actual code.
@Matt J Apologies for the issue. I accidentally defined the mask variable as a 9 x 9 array in my post rather than as a 7 x 7 array (another part of my code uses a 9 x 9 array and I got a bit mixed up). The code runs on my end now. I would provide the actual code, but as it is currently configured, it accepts files as input, and the files are over 60 megabytes in size so I don't think I'd be able to upload it. Regardless, it does run now.
After running my code on my datasets, I'm noticing that mask arrays with lots of 0s or negative numbers tend to be giving the 16 GB machine trouble. As an example, replacing the entire top row of the mask array in this question with 0 reproduced the problem I've been describing, but if I use another data point with all positive numbers, or even with some negative numbers, the problem disappears on the 16 GB machine. @Jan, in the second for loop of CuteTest, I changed this to
mask = ones(16) * 16;
for i = 1:10000
mask(1, :) = mask(1, :) - 1;
tic
q = log(mask);
elapsedTime = toc;
if elapsedTime > 0.0001
disp(i)
end
end
As expected, the first number I reliably saw was i = 17, and I saw every i from 17 to 10,000. I hope this helps.

Sign in to comment.

 Accepted Answer

I'm noticing that mask arrays with lots of 0s or negative numbers tend to be giving the 16 GB machine trouble.
Because your model is Poisson, you should be requiring that your psf model take positive values only, which means that the additive term params(5) must be non-negative.. If you were using fmincon, you could specify this constraint explicitly. With fminsearch, you cannot apply constraints directly, but a workaround is to use a change of variables, such that params(5) becomes params(5)^2,
psf = psf + params(5)^2;
Naturally, of course, you must modify MLE_output accordingly,
MLE_output = fminsearch(neg_log_likelihood_fxn, MLE_initial_guess, options);
MLE_output(5)=MLE_output(5)^2;

3 Comments

Thank you so much! After making this modification on both the 8 GB and 16 GB machines, I saw drastic performance improvements on both machines and the 16 GB worked even better than the 8 GB machine.
I'm still curious about why taking the log of a negative number takes so much longer on the 16 GB machine given that both laptops have M1 processors, but the immediate problem seems to be solved.
Glad it's working. Come to think of though, you have a similar problem with the multiplier parameter params(4). That can flip the sign of the psf as well. Not sure why fixing only params(5) made the difference...
params(4) is the sum of all of the values in the array, and those are strictly greater than or equal to 0, so I've never had to deal with an issue like that to be honest. But I have noticed that for very poor fits params(4) comes out negative. Usually if the fit is that bad I just choose another start point and try to fit again.

Sign in to comment.

More Answers (1)

Summary:
On the 16 GB M1 MacBook, the lines
tic
tdg_log = log(tdg);
toc
displays, that the processing needs 1.5 seconds. Using the same data, a loop with 10'000 iterations takes 0.008 seconds:
tic
for i = 1:1:10000
log(mask);
end
toc
This means, that the 8GB Intel MacBook is 1'875'000 times faster for this operation.
There are some reasonable ideas to explain this:
  • The Matlab version you use has a massive problem with the M1 CPU.
  • The profiler is completely confused. (But the total run-time seems to exclude this option)
  • You do not run the same code. Matt J found out, that the code posted here in the forum does not run at all, but stops with an error.
A speed difference of the factor of almost 2 million is extremely unlikely and cannot be caused by a problem with the multithreading. For tiny 9x9 matrices starting mutliple threads would be a waste of time in all cases, so maybe the log() function does apply a multithreading on the M1 chip, but this would be strange. The Matlab functions of 2019 do not include special adjustments for the M1 chip, which was invented later.
By the way, the function twoD_Gaussian2 can be improved:
function psf = twoD_Gaussian2(params, array_size)
x = 1:array_size;
k = 2 * params(3)^2;
psf = params(4) / (k * pi) * ...
exp(-(x - params(2)).^2 / k) .* exp(-(x.' - params(1)).^2 / k) + ...
params(5);
end
Using meshgrid to create an NxN matrix requires to calculate N*N expensive exp() functions. Using the vectors instead and the relation: exp(a+b) = exp(a)*exp(b) is cheaper. For a 9x9 matrix this is 4 times faster and the benefit grows quadratically with the array size.

2 Comments

Hi Jan,
Thanks for the advice regarding twoD_Gaussian. Unfortunately both the 8 GB and 16 GB laptops have M1 processors so I don't think the issue has to do with Intel vs M1.
I was confused by the names "8GB MacBook Pro 2020" and "16GB MacBook Pro M1".

Sign in to comment.

Categories

Find more on MATLAB in Help Center and File Exchange

Products

Release

R2019a

Asked:

on 24 Dec 2021

Commented:

on 31 Dec 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!