vectorization of a for loop with an if else statement and running over 4 indices

11 views (last 30 days)
I have a function that contains the for loop below, and that loop makes the function roughly 300 times slower to evaluate (200 times if I use parfor) than another implementation I have. I considered the following options so far: vectorization, GPU computation, or conversion to C++. I have an Intel Iris Xe Graphics GPU, so I guess GPU-based computation is out of question. This is unfortunate, because I think that could really help (perhaps in combination with vectorization) with those long algebraic operations (they are considerably longer than those I posted below). Conversion to C++ (perhaps via Matlab Coder, which I have installed) is perhaps too complicated and time-consuming at this point. That leaves me with vectorization.
A = zeros(size(B));
for u = 1:3
for v = 1:3
for i = 1:3
for j = 1:3
if u == 3 || v == 3
A(i,u,j,v) = 1000;
else
A(i,u,j,v) = (2*pi)*...
((((F(i,j)*r(1)^(u + v)/((x + y(1)^2/y)*z1(y(1)))) ...
+ (F(i,j)*r(2)^(u + v)/((x + y(2)^2/y)*z2(y(2))))) ...
+ (F(i,j)*(y*sqrt(2))^(u + v)/(2*sqrt(2)*f(y*sqrt(2)))))
end
end
end
end
end
But before coming to that, I think one of the starting points, should be to evaluate all constants currently in the loop before it starts. I just did that (updated code below) and could quite dramaticaly improve the performance to about 75 times the fastest implementation I have. Now, hopefully vectorization allows me to further decrease this computing time by roughly one order of magnitude, which would be more than enough for my current needs. However, I have very limited experience with code vectorization. Could someone in this forum help me vectorize what remains of the for loop initially in question?
C0 = 2*pi;
C1 = (x + y(1)^2/y)*z1(y(1));
C2 = (x + y(2)^2/y)*z2(y(2));
C3 = 2*sqrt(2)*f(y*sqrt(2));
C4 = y*sqrt(2);
A = zeros(size(B));
for u = 1:3
for v = 1:3
for i = 1:3
for j = 1:3
if u == 3 || v == 3
A(i,u,j,v) = 1000;
else
A(i,u,j,v) = (C0)*...
((((F(i,j)*r(1)^(u + v)/(C1)) ...
+ (F(i,j)*r(2)^(u + v)/(C2))) ...
+ (F(i,j)*(C4)^(u + v)/(C3)))
end
end
end
end
end
  6 Comments
Euclides
Euclides on 7 Feb 2024
Edited: Euclides on 8 Feb 2024
@Walter Roberson: thanks a lot! I assume the code cannot be further vectorized; so, I will immediately start implementing your proposed upgrades. That will take some time however, because "you put F(:,j) outside the brackets" and I have extensive algebraic operations, as I mentioned above, which involve several other F(:j)-like ocurrences. I'll get back to you with results as soon as possible.
Euclides
Euclides on 8 Feb 2024
Edited: Euclides on 8 Feb 2024
moving F(i,j) outside the brackets cuts the computing time from 75 to roughly 50 times the fastest implementation I have. However, I get an indexing error if I remove the i index loop and use A(:,u,j,v) and F(:,j). Not sure why, yet. Removing that loop could be critical for a more pronounced reduction in computing time.

Sign in to comment.

Accepted Answer

Matt J
Matt J on 8 Feb 2024
Edited: Matt J on 8 Feb 2024
Here's a vectorized solution. Note that the only reason the permute() operation (which is expensive) is needed is because your coordinates are A(i,u,j,v) instead of A(i,j,u,v). If you can reorganize the rest of your code to support the latter data ordering, you will save computation.
[I,J,U,V]=deal(1:3,1:3,1:2,1:2);
UplusV=reshape(U(:)+V(:).',1,[]);
A=C0.*F(:) .* (r(1).^UplusV./C1 + r(2).^UplusV./C2 + C4.^UplusV./C3);
A=reshape(A, numel(I), numel(J), numel(U), numel(V));
A=permute(A,[1,3,2,4]); %Try to eliminate this
A(:,3,:,:) = 1000;
A(:,:,:,3) = 1000;
  21 Comments
Euclides
Euclides on 12 Feb 2024
Edited: Euclides on 12 Feb 2024
@Matt J: so, after restarting and upgrading to Matlab 2023b the problem persisted. I checked line by line the output classes and finally understood what the problem was. I had mixed double and symbolic outputs. After converting the latter to double, the code finally ran smoothly. Even though the indices run from 1 to 2/3, I again considerably decreased the runtime by about 7 times in line with your results above, so overall this final version of the vectorized code yielded a 35 times reduction in runtime vis-a-vis the runtime of the initial for loop code. Fantastic! I will still prepare a GPU-based code and test it on a machine with a NVIDIA GeForce GPU, just to see how much that runtime improves. Then, at a some later point, I might as well test a C++ version of the code. I accepted your answer, but the actual answer (version 2 of the vectorized code) is given in one of your comments. Not sure, but perhaps you should update the code in the the original answer with that in the comments. Thank you so much for your help Matt!

Sign in to comment.

More Answers (0)

Categories

Find more on Creating, Deleting, and Querying Graphics Objects in Help Center and File Exchange

Products


Release

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!