testing SVD Performance on M1
Show older comments
Hello Community,
here is a script for testing the M1 performance on solving a SVD Problem. (Parallel Computing Toolbox is required)
% This script evaluates the Singular Value Decomposition (SVD) of size 1 to N.
% It detect also the maximum number of threads and determine the parallel
% effiency.
%% clear and define environment
clear all;delete(gcp('nocreate'));clc;
%% detect max threads
core_info = evalc('feature(''numcores'')');
maxThreads = str2num(core_info(53));
disp(['maximum number of simultanously threads: ',num2str(maxThreads)])
%% define lowest possible problemsize N wihtout a reminder for every thread
N = smallest_multiple(max(maxThreads,8));
% increase N to around 1000 for differen
%if N < 1000
% N = ceil(1000/N)*N;
%end
disp(['problem size N = ',num2str(N)])
%% benchmark
result = kron(1:maxThreads,[1 0 0]')';
for k = 1:maxThreads
y = zeros(N,1);
myCluster = parcluster('local');
myCluster.NumWorkers = k; % 'Modified' property now TRUE
saveProfile(myCluster);
% evalc('parpool(''local'',k)');
tic
parfor n = 1:N
y(n) = max(svd(randn(n)));
end
result(k,2)=toc;
disp([' Problem solved with ',num2str(k),' of '...
,num2str(maxThreads),' threads has finished in ',num2str(result(k,2)),'s'])
evalc('delete(gcp(''nocreate''))');
end
%% present the results
clc;
result(:,3) = result(1,2)./(result(:,2).*result(:,1));
disp(' #threads |time in s|Efficiency')
disp(result)
%% alternative solution to smallest multiple
% https://de.mathworks.com/matlabcentral/answers/386271-write-a-function-called-smallest_multiple#answer_319994
function r = smallest_multiple(k)
r = 1;
for n = 1:k
r = r * (n / gcd(r,n));
end
end
Could anybody run the script and post the output?
I can't understand the bad performance:
#threads |time in s|Efficiency
1.0000 93.1989 1.0000
2.0000 80.0887 0.5818
3.0000 76.6610 0.4052
4.0000 87.3170 0.2668
5 Comments
Benny Hartwig
on 31 Jul 2021
Dear Marko,
I've run your script on MacMini M1 with 8gb and observed similar numbers as you did. However, when altering a few lines of codes I observed increased efficiency for this particular exercise. Specifically, I changed:
maxThreads = str2num(core_info(53))-4; % deduct number of low performance cores
N = smallest_multiple(max(maxThreads,8))*2; % increase the number of iterations
y(n) = max(svd(randn(500))); % fix the size of the random matrix
The most important change is the thrid one of fixing the size of a randomly generated matrix. I think the reason is that the batches are somewhat unequally distributed in terms of size of the random matrix.
based on N = 1680
#threads |time in s|Efficiency
1.0000 61.9659 1.0000
2.0000 34.3137 0.9029
3.0000 26.9083 0.7676
4.0000 23.4078 0.6618
based on N = 3360
#threads |time in s|Efficiency
1.0000 111.3407 1.0000
2.0000 60.9948 0.9127
3.0000 45.1030 0.8229
4.0000 37.0086 0.7521
based on N = 8400
#threads |time in s|Efficiency
1.0000 263.5285 1.0000
2.0000 140.6530 0.9368
3.0000 105.1190 0.8357
4.0000 82.8724 0.7950
Best,
Benny
Marko
on 31 Jul 2021
Benny Hartwig
on 2 Aug 2021
Hi Marko,
intersting, so there is quite a performance boost with the M1 machine. Did you manage to connect to all 8 cores in Matlab? When I was trying it parpool() did not start with more than four cores.
Best,
Benny
Benny Hartwig
on 2 Aug 2021
Hi Marko,
thank you very much for the hint. I managed to connect the M1 Mini 8gb to all eight cores and run the benchmark run again. However, I needed to make some changes to the exercise to get a better understanding of the performance:
N = smallest_multiple(max(maxThreads,8))*1000*5 % increase the size of the loop
evalc('parpool(''local'',k)'); % activate before tic toc (otherwise dilutes time keeping)
parfor n = 1:N*k % scale the loop by the number of threads s.t. every worker has to finish N jobs
y(n) = max(svd(randn(10))); % reduce size of the random matrix to contain memory pressure
end
result(k,2)=toc/k; % divide total time by the number of workers
#threads |time in s|Efficiency
1.0000 33.1795 1.0000
2.0000 17.1001 0.9702
3.0000 11.8776 0.9312
4.0000 9.2032 0.9013
5.0000 8.4971 0.7810
6.0000 7.7836 0.7105
7.0000 7.2541 0.6534
8.0000 7.1321 0.5815

So its seems that the efficiency cores also improve the performance but are a bit slower than the performance cores. Moreover, the chart on memory pressure indicates that these efficiency numbers might be downward biased because the memory usage turned yellow during the run with 5 to 8 threads. With 1 to 4 threads, the memory useage was always green.
So it probably pays off to get the model with 16gb of ram as the ram consumption of the parfor loop increases quite strongly when you add more workers. Maybe this problem could be solved when Matlab runs natively on the M1.
Best,
Benny
Answers (0)
Categories
Find more on Parallel for-Loops (parfor) in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!