You are now following this question
- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.
vectorization of a for loop with an if else statement and running over 4 indices
11 views (last 30 days)
Show older comments
Euclides
on 7 Feb 2024
I have a function that contains the for loop below, and that loop makes the function roughly 300 times slower to evaluate (200 times if I use parfor) than another implementation I have. I considered the following options so far: vectorization, GPU computation, or conversion to C++. I have an Intel Iris Xe Graphics GPU, so I guess GPU-based computation is out of question. This is unfortunate, because I think that could really help (perhaps in combination with vectorization) with those long algebraic operations (they are considerably longer than those I posted below). Conversion to C++ (perhaps via Matlab Coder, which I have installed) is perhaps too complicated and time-consuming at this point. That leaves me with vectorization.
A = zeros(size(B));
for u = 1:3
for v = 1:3
for i = 1:3
for j = 1:3
if u == 3 || v == 3
A(i,u,j,v) = 1000;
else
A(i,u,j,v) = (2*pi)*...
((((F(i,j)*r(1)^(u + v)/((x + y(1)^2/y)*z1(y(1)))) ...
+ (F(i,j)*r(2)^(u + v)/((x + y(2)^2/y)*z2(y(2))))) ...
+ (F(i,j)*(y*sqrt(2))^(u + v)/(2*sqrt(2)*f(y*sqrt(2)))))
end
end
end
end
end
But before coming to that, I think one of the starting points, should be to evaluate all constants currently in the loop before it starts. I just did that (updated code below) and could quite dramaticaly improve the performance to about 75 times the fastest implementation I have. Now, hopefully vectorization allows me to further decrease this computing time by roughly one order of magnitude, which would be more than enough for my current needs. However, I have very limited experience with code vectorization. Could someone in this forum help me vectorize what remains of the for loop initially in question?
C0 = 2*pi;
C1 = (x + y(1)^2/y)*z1(y(1));
C2 = (x + y(2)^2/y)*z2(y(2));
C3 = 2*sqrt(2)*f(y*sqrt(2));
C4 = y*sqrt(2);
A = zeros(size(B));
for u = 1:3
for v = 1:3
for i = 1:3
for j = 1:3
if u == 3 || v == 3
A(i,u,j,v) = 1000;
else
A(i,u,j,v) = (C0)*...
((((F(i,j)*r(1)^(u + v)/(C1)) ...
+ (F(i,j)*r(2)^(u + v)/(C2))) ...
+ (F(i,j)*(C4)^(u + v)/(C3)))
end
end
end
end
end
6 Comments
Walter Roberson
on 7 Feb 2024
Edited: Walter Roberson
on 7 Feb 2024
Reorder the loop indices so that the first index varies most quickly.
Move the test to outside of the loop.
C0 = 2*pi;
C1 = (x + y(1)^2/y)*z1(y(1));
C2 = (x + y(2)^2/y)*z2(y(2));
C3 = 2*sqrt(2)*f(y*sqrt(2));
C4 = y*sqrt(2);
A = zeros(size(B));
for v = 1:2
for j = 1:3
for u = 1:2
for i = 1:3
A(i,u,j,v) = (C0)*...
((((F(i,j)*r(1)^(u + v)/(C1)) ...
+ (F(i,j)*r(2)^(u + v)/(C2))) ...
+ (F(i,j)*(C4)^(u + v)/(C3)))
end
end
end
end
A(:,3,:,:) = 1000;
A(:,:,:,3) = 1000;
Euclides
on 7 Feb 2024
Edited: Euclides
on 7 Feb 2024
thank you @Walter Roberson. Skipping that if else statement was pretty obvious :-). I just tested that: no speed gains at all. What about the for loops? Any chance of vectorizing those?
Walter Roberson
on 7 Feb 2024
C0 = 2*pi;
C1 = (x + y(1)^2/y)*z1(y(1));
C2 = (x + y(2)^2/y)*z2(y(2));
C3 = 2*sqrt(2)*f(y*sqrt(2));
C4 = y*sqrt(2);
A = zeros(size(B));
for v = 1:2
for j = 1:3
for u = 1:2
A(:,u,j,v) = (C0) .* ...
F(:,j) .* ( r(1)^(u + v)/(C1) ...
+ r(2)^(u + v)/(C2) ...
+ (C4)^(u + v)/(C3) );
end
end
end
A(:,3,:,:) = 1000;
A(:,:,:,3) = 1000;
Walter Roberson
on 7 Feb 2024
C0 = 2*pi;
C1 = (x + y(1)^2/y)*z1(y(1));
C2 = (x + y(2)^2/y)*z2(y(2));
C3 = 2*sqrt(2)*f(y*sqrt(2));
C4 = y*sqrt(2);
A = zeros(size(B));
u = 1:2;
for v = 1:2
for j = 1:3
A(:,u,j,v) = (C0) .* ...
F(:,j) .* ( r(1).^(u + v)/(C1) ...
+ r(2).^(u + v)/(C2) ...
+ (C4).^(u + v)/(C3) );
end
end
A(:,3,:,:) = 1000;
A(:,:,:,3) = 1000;
Euclides
on 7 Feb 2024
Edited: Euclides
on 8 Feb 2024
@Walter Roberson: thanks a lot! I assume the code cannot be further vectorized; so, I will immediately start implementing your proposed upgrades. That will take some time however, because "you put F(:,j) outside the brackets" and I have extensive algebraic operations, as I mentioned above, which involve several other F(:j)-like ocurrences. I'll get back to you with results as soon as possible.
Euclides
on 8 Feb 2024
Edited: Euclides
on 8 Feb 2024
moving F(i,j) outside the brackets cuts the computing time from 75 to roughly 50 times the fastest implementation I have. However, I get an indexing error if I remove the i index loop and use A(:,u,j,v) and F(:,j). Not sure why, yet. Removing that loop could be critical for a more pronounced reduction in computing time.
Accepted Answer
Matt J
on 8 Feb 2024
Edited: Matt J
on 8 Feb 2024
Here's a vectorized solution. Note that the only reason the permute() operation (which is expensive) is needed is because your coordinates are A(i,u,j,v) instead of A(i,j,u,v). If you can reorganize the rest of your code to support the latter data ordering, you will save computation.
[I,J,U,V]=deal(1:3,1:3,1:2,1:2);
UplusV=reshape(U(:)+V(:).',1,[]);
A=C0.*F(:) .* (r(1).^UplusV./C1 + r(2).^UplusV./C2 + C4.^UplusV./C3);
A=reshape(A, numel(I), numel(J), numel(U), numel(V));
A=permute(A,[1,3,2,4]); %Try to eliminate this
A(:,3,:,:) = 1000;
A(:,:,:,3) = 1000;
21 Comments
Matt J
on 8 Feb 2024
Edited: Matt J
on 8 Feb 2024
Here's a slightly modified implementation using ndgridVecs, downloadable from
It avoids the permute() operation.
[I,U,J,V]=ndgridVecs(1:3,1:2,1:3,1:2);
UplusV=U+V;
F=C0*reshape(F,numel(I), 1, numel(J), 1);
A=F .* (r(1).^UplusV./C1 + r(2).^UplusV./C2 + C4.^UplusV./C3);
A(:,3,:,:) = 1000;
A(:,:,:,3) = 1000;
Euclides
on 8 Feb 2024
Edited: Euclides
on 8 Feb 2024
@Matt J: what I meant with vectorizing in terms of (i,u,j,v) was that in the first line of your first solution you have [I,J,U,V]=deal(1:3,1:3,1:2,1:2); so, I thought perhaps doing [i,u,j,v] = deal (1:3,1:2,1:3,1:2) and rearranging the subsequent code accordingly, would eventually eliminate the need for that permutation line. But this was speculative, as I am neither familiar with the deal function, nor with the whole vectorization logic. As I said above: I will need some time before I am able to discuss various vectorization approaches in detail.
Euclides
on 8 Feb 2024
Edited: Euclides
on 8 Feb 2024
@Matt J: regarding the NVIDIA GPU, I will need to be very careful with that component when I invest in my next piece of hardware, since Matlab is so instrumental in my line of work. I can of course also code in languages like Python, C++, or Julia, which may perform faster for some types of problems and use the various available interfaces to link source codes written in different languages, but that is not my highest priority in the short-term; and, perhaps as importantly, it is certainly good pratice to have complete flexibility in terms of a wide range of computational resources at disposal and GPU computation, in the framework of Matlab, is an approach to performance improvement that should definitely be always available.
Matt J
on 8 Feb 2024
Edited: Matt J
on 9 Feb 2024
The second version shows about a 7x speedup here online:
test()
Elapsed time is 0.278296 seconds.
Elapsed time is 0.180850 seconds.
Elapsed time is 0.284189 seconds.
function test
M=4000;
[C0,C1,C2,C3,C4,r, F]=deal(1,1,1,1,1,[1,1], rand(M));
tic; vecVersion1; toc
tic; vecVersion2; toc
tic; loopVersion; toc
function loopVersion
A = zeros(M,2,M,2);
for v = 1:2
for j = 1:M
for u = 1:2
A(:,u,j,v) = (C0) .* ...
F(:,j) .* ( r(1)^(u + v)/(C1) ...
+ r(2)^(u + v)/(C2) ...
+ (C4)^(u + v)/(C3) );
end
end
end
end
function vecVersion2
[I,U,J,V]=ndgridVecs(1:M,1:2,1:M,1:2);
UplusV=U+V;
F=C0*reshape(F,numel(I), 1, numel(J), 1);
A=F .* (r(1).^UplusV./C1 + r(2).^UplusV./C2 + C4.^UplusV./C3);
end
function vecVersion1
[I,J,U,V]=deal(1:M,1:M,1:2,1:2);
UplusV=reshape(U(:)+V(:).',1,[]);
A=C0.*F(:) .* (r(1).^UplusV./C1 + r(2).^UplusV./C2 + C4.^UplusV./C3);
A=reshape(A, numel(I), numel(J), numel(U), numel(V));
A=permute(A,[1,3,2,4]);
end
end
Euclides
on 9 Feb 2024
Edited: Euclides
on 9 Feb 2024
@Matt J: thank you once again for the help! I just implemented the first version of the vectorized code (VC). On average, the runtime is slightly better, but not a lot, possibly because of that permutation operation. I wanted to implement the second version of the VC, but your latest VC as a fixed M for all indices and u and v should only range from 1 to 2. Is it possible to add that restriction to that second version of the VC? Or is it too complicated? Just to have a rough idea of how much time I might need to solve the problem by myself.
Matt J
on 9 Feb 2024
Edited: Matt J
on 9 Feb 2024
No, you can call ndgridVecs with separate ranges for all 4 variables, as my comment here illustrates. However, if the ranges of u and v are bounded by 2 then the ranges for i and j would have to be pretty large to make the optimization effort worthwhile. Certainly if the ranges of all the variables are on the order of 2 or 3 like in your original example, vectorization is not going to help you.
Euclides
on 9 Feb 2024
@Matt J: I just read an older comment of yours containing a VC version without permutation and in which you start with [I,U,J,V]=ndgridVecs(1:3,1:2,1:3,1:2). That should be sufficient for a first test of the real impact of that permutation operation on the average runtime, which I will do immediately. But first I need to DL your (very recent!) ndgridVecs function (and cite it of course). Thanks!
Matt J
on 9 Feb 2024
Not me. If you look back to my timing tests (now modified with U=V=1:2) you will see that I get results for all 3 versions,
Matt J
on 9 Feb 2024
F is not the output of the code. A is. But I imagine what you are seeing is the result of the reshape() operation in version 2.
Euclides
on 9 Feb 2024
Edited: Euclides
on 9 Feb 2024
´@Matt J: sure A is the output, but I get an error with the array sizes. So, I was checking the structure of the F array (after reshaping). I literally copy pasted the version 2 code and it gives the error I mentioned above. No idea why. Here the output, line by line, with H = r(1).^UplusV (I shortened the expression after F.*- in A, just to see more easily why there is an array mismatch):
I =
1
2
3
U =
1 2
J(:,:,1) =
1
J(:,:,2) =
2
J(:,:,3) =
3
V(:,:,1,1) =
1
V(:,:,1,2) =
2
UplusV(:,:,1,1) =
2 3
UplusV(:,:,1,2) =
3 4
F(:,:,1) =
component 1 (this an all other components henceforth are scalars)
component 2
component 3
F(:,:,2) =
component 1
component 2
component 3
F(:,:,3) =
component 1
component 2
component 3
H(:,:,1,1) =
[component 1, component 2]
H(:,:,1,2) =
[component 1, component 2]
Matt J
on 9 Feb 2024
Edited: Matt J
on 9 Feb 2024
Notice above that I ran my proposed code for you so you could see it work directly. I ran it here online, in the forum, rather than on my local computer. I suggest you do the same for us to demonstrate the problem you're seeing.
But it seems pretty obvious to me why you are seeing what you describe. Because of the way you reshaped F in this line,
F=C0*reshape(F,numel(I), 1, numel(J), 1);
it means that F(:,:,i) will be an Nx1x1x1 matrix for any fixed i. Equivalently, it will be an Nx1 vector. This is exactly consistent with what you describe in your last comment.
Euclides
on 10 Feb 2024
Edited: Euclides
on 10 Feb 2024
just ran version 2 on my Matlab Mobile and had no problems. Need to check more carefully, but looks like H, as I defined above, yields a row cell with rectangular brackets on my notebook, while on my mobile phone I get a regular row vector without brackets. That could be the problem. Perhaps, I should upgrade to Matlab 2023 a or b, not sure what the latest release is..............
Matt J
on 10 Feb 2024
There shouldn't be any difference in H on any version higher than R2016b (lower versions should give you an error). The size of U+V, and therefore of H, can be verified right here,
M=5;
[I,U,J,V]=ndgridVecs(1:M,1:2,1:M,1:2);
size(U+V)
ans = 1×4
1 2 1 2
Euclides
on 10 Feb 2024
Edited: Euclides
on 10 Feb 2024
it is for sure a problem with those rectangular brackets. you have r(1) = 1 in your code. if I do 1.^UplusV instead of r(1).^UplusV, on my local machine, I get the same results for H but without rectangular brackets and the code runs smoothly. I will try to first restart the notebook (I experienced a few times erratic Matlab behavior [incorrect calculations, etc.] that completely vanished after restarting) and if the problem persists, upgrade to the latest Matlab version. If, after these actions, the problem still persists I need to understand how to get rid of those brackets in the H output
Euclides
on 12 Feb 2024
Edited: Euclides
on 12 Feb 2024
@Matt J: so, after restarting and upgrading to Matlab 2023b the problem persisted. I checked line by line the output classes and finally understood what the problem was. I had mixed double and symbolic outputs. After converting the latter to double, the code finally ran smoothly. Even though the indices run from 1 to 2/3, I again considerably decreased the runtime by about 7 times in line with your results above, so overall this final version of the vectorized code yielded a 35 times reduction in runtime vis-a-vis the runtime of the initial for loop code. Fantastic! I will still prepare a GPU-based code and test it on a machine with a NVIDIA GeForce GPU, just to see how much that runtime improves. Then, at a some later point, I might as well test a C++ version of the code. I accepted your answer, but the actual answer (version 2 of the vectorized code) is given in one of your comments. Not sure, but perhaps you should update the code in the the original answer with that in the comments. Thank you so much for your help Matt!
More Answers (0)
See Also
Categories
Find more on Creating, Deleting, and Querying Graphics Objects in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!An Error Occurred
Unable to complete the action because of changes made to the page. Reload the page to see its updated state.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)