Sudden drop in speed for large matrices. RAM full -> Prevention?

Dear all,
I am doing image analyses with a toolbox called "PIVlab". At work, we recently bought a new camera with 25 MP resolution (previously only 2 MP). That is very nice, but it also means that the sizes of several matrices increase substantially, and now suddenly, memory managment seems to become a very important task to keep the processing speed high. I identified the operation that seems to slow the code down with very large matrices, and wrote a minimal code example to reproduce the behaviour:
clear all %#ok<CLALL>
close all
clc
% Usually, in PIV, we have input image A and input image B which are
% captured with a short pause in between. These images are then cut into
% small pieces of e.g. 64x64 pixels. Each of these "interrogation areas"
% in image A is then cross-correlated with the same part from image B.
% With the cross-correlation code in PIVlab, it is possible to to this for
% a 3D matrix (saves a lot of time, but apparently runs into RAM problems
% when matrices are large).
counter=0;
stack_sizes=1000:10000:200000 ;%these numbers are fine to demonstrate the effect on a 16 GB RAM laptop
for size_of_the_stack=stack_sizes
%% generate some quick + dirty "particle" image pairs, arranged in a 3D matrix
A=rand(64,64,size_of_the_stack);
A(A<0.98)=0;
A = imgaussfilt(A,0.9);
B = circshift(A,round(rand([1,1])*10)); %displace the second image
%% Quick fix: converting to single already saves 50% of memory, apparently without negative effects.
A=single(A);
B=single(B);
%% do the cross-correlation with FFT
result_conv = zeros(size(A),'single'); %same starting conditions for all repetitions of the loop %#ok<PREALL> %
tic
% The following line of code eats up all the RAM
result_conv = fftshift(fftshift(real(ifft2(conj(fft2(A)).*fft2(B))), 1), 2); %cross-correlation code in PIVlab
counter=counter+1
calc_time(counter)=toc;
time_per_subimage(counter)=calc_time(counter)/size_of_the_stack;
end
%% plot results
bar(stack_sizes,time_per_subimage*1000)
grid on;
xlabel('stack size')
ylabel('time per correlation in ms')
These are my results. The speed suddenly decreases.
A quick look at the task manager already shows that speed decreases in the moment where the cross-correlation takes all RAM and when virtual memory has to be used (indicated by the sudden increasy in SSD activity).
It is obvious that this line of code is the cause for the drop in speed with large matrices:
result_conv = fftshift(fftshift(real(ifft2(conj(fft2(A)).*fft2(B))), 1), 2);
How could I prevent that RAM is filled up and speed drops? Is there a way to anticipate problems, and then maybe split the arrays into smaller chunks, saving them temporarily to the hard disk...?
The problem is slightly more complicated in the real application, because there is a lot of other processing happening, and everything is run in parallel with the parallel computing toolbox... But finding a solution for the above minimal example might already bring me on the right track... Thanks for your input!

13 Comments

Is it worth to mention, that installing mor RAM would be the direct way to handle this problem?
Installing more RAM is not possible on the laptop that we bought for processing PIV data. It is soldered to the hardware. I chose a laptop with a fast CPU based on my experience in the past with PIVlab, however the CPU is not really stressed when running the code in parallel because RAM is the bottleneck :( So the only option to improve the performance (instead of spending 2000 Euros for a new laptop with more RAM) is to improve the code....
Does it help to try to reuse the memory?
% result_conv = fftshift(fftshift(real(ifft2(conj(fft2(A)).*fft2(B))), 1), 2);
B(:) = fft2(B);
B(:) = conj(fft2(A)) .* B;
B(:) = ifft2(B);
result_conv = fftshift(real(B), 1);
result_conv(:) = fftshift(result_conv, 2);
@Jan: Your suggestion increases the time for smaller matrices, but has little effect on larger matrices...
When I quickly split the cross-correlation operation into two pieces (hopefully done correctly...), then everything seems to run a bit faster...? So vectorization speed improvements seems to have an maximum matrix size? Or am I just doing something wrong...
clear all %#ok<CLALL>
close all
clc
% Usually, in PIV, we have input image A and input image B which are
% captured with a short pause in between. These images are then cut into
% small pieces of e.g. 64x64 pixels. Each of these "interrogation areas"
% in image A is then cross-correlated with the same part from image B.
% With the cross-correlation code in PIVlab, it is possible to to this for
% a 3D matrix (saves a lot of time, but apparently runs into RAM problems
% when matrices are large).
counter=0;
stack_sizes=1000:10000:200000 ;%these numbers are fine to demonstrate the effect on a 16 GB RAM laptop
for size_of_the_stack=stack_sizes
%% generate some quick + dirty "particle" image pairs, arranged in a 3D matrix
A=rand(64,64,size_of_the_stack);
A(A<0.98)=0;
A = imgaussfilt(A,0.9);
B = circshift(A,round(rand([1,1])*10)); %displace the second image
%% converting to single already saves 50% of memory, apparently without negative effects.
A=single(A);
B=single(B);
bla=memory;
%% do the cross-correlation with FFT
result_conv_1_of_2 = zeros([size(A,1:2) size_of_the_stack/2],'single'); %same starting conditions for all repetitions of the loop %#ok<PREALL> %
result_conv_2_of_2 = result_conv_1_of_2; %same starting conditions for all repetitions of the loop %#ok<PREALL> %
tic
result_conv_1_of_2 = fftshift(fftshift(real(ifft2(conj(fft2(A(:,:,1:size_of_the_stack/2))).*fft2(B(:,:,1:size_of_the_stack/2)))), 1), 2); %cross-correlation code in PIVlab
result_conv_2_of_2 = fftshift(fftshift(real(ifft2(conj(fft2(A(:,:,size_of_the_stack/2+1:end))).*fft2(B(:,:,size_of_the_stack/2+1:end)))), 1), 2); %cross-correlation code in PIVlab
%result_conv = fftshift(fftshift(real(ifft2(conj(fft2(A)).*fft2(B))), 1), 2); %cross-correlation code in PIVlab
counter=counter+1
calc_time(counter)=toc;
time_per_subimage(counter)=calc_time(counter)/size_of_the_stack;
avail_mem(counter)= bla.MemAvailableAllArrays;
end
%% plot results
bar(stack_sizes,time_per_subimage*1000)
grid on;
xlabel('stack size')
ylabel('time per correlation in ms')
... and if I split it into 4 parts, then it becomes even better?
%% do the cross-correlation with FFT
result_conv_1_of_4 = zeros([size(A,1:2) size_of_the_stack/4],'single'); %same starting conditions for all repetitions of the loop %#ok<PREALL> %
result_conv_2_of_4 = result_conv_1_of_4; %same starting conditions for all repetitions of the loop %#ok<PREALL> %
result_conv_3_of_4 = result_conv_1_of_4; %same starting conditions for all repetitions of the loop %#ok<PREALL> %
result_conv_4_of_4 = result_conv_1_of_4; %same starting conditions for all repetitions of the loop %#ok<PREALL> %
tic
result_conv_1_of_4 = fftshift(fftshift(real(ifft2(conj(fft2(A(:,:,1:size_of_the_stack/4))).*fft2(B(:,:,1:size_of_the_stack/4)))), 1), 2); %cross-correlation code in PIVlab
result_conv_2_of_4 = fftshift(fftshift(real(ifft2(conj(fft2(A(:,:,size_of_the_stack/4+1:size_of_the_stack/4*2))).*fft2(B(:,:,size_of_the_stack/4+1:size_of_the_stack/4*2)))), 1), 2); %cross-correlation code in PIVlab
result_conv_3_of_4 = fftshift(fftshift(real(ifft2(conj(fft2(A(:,:,size_of_the_stack/4*2+1:size_of_the_stack/4*3))).*fft2(B(:,:,size_of_the_stack/4*2+1:size_of_the_stack/4*3)))), 1), 2); %cross-correlation code in PIVlab
result_conv_4_of_4 = fftshift(fftshift(real(ifft2(conj(fft2(A(:,:,size_of_the_stack/4*3+1:end))).*fft2(B(:,:,size_of_the_stack/4*3+1:end)))), 1), 2); %cross-correlation code in PIVlab
%result_conv = fftshift(fftshift(real(ifft2(conj(fft2(A)).*fft2(B))), 1), 2); %cross-correlation code in PIVlab
counter=counter+1
calc_time(counter)=toc;
Another test with the modified code and larger stack sizes shows, that the problem has been moved to larger matrices. But I think this approach might have same potential... The larger the input matrices (A & B), the more these should be split and processed in series to avoid problems. But how to decide when to split the matrices and when not to?
Is it true that speed benefits from vectorization is limited by RAM and array size? I still think I am overlooking something X-)
Since you only do the cross-correlation with fft2 you shouldn't "obviously" gain much "except calling-overhead" by processing the entire stack?
@Bjorn Gustavsson Thanks for the input, I tested it and there is some speed difference (probably that is explained by the "calling-overhead"). Everything is slightly slower, I am running tests for larger matrices now.
%vectorized
result_conv = fftshift(fftshift(real(ifft2(conj(fft2(A)).*fft2(B))), 1), 2); %cross-correlation code in PIVlab
% for loop
for i=1:size(A,3)
result_conv(:,:,i) = fftshift(fftshift(real(ifft2(conj(fft2(A(:,:,i))).*fft2(B(:,:,i)))), 1), 2); %cross-correlation code in PIVlab
end
(Now I'm outside my competence-zone - typically I'm happy enough when I get my code to slog-through and patient enough to wait for my results, I don't work with production-line applications where good/optimal efficiency is imperative)
From here you might be happy enough to make these types of timing-investigations and make some kind of look-up tables for the best chunking of the image-stacks so that your function can adapt to that in some clever manner.
Yes, I am currently evaluating this code portion (still takinf some time to complete...):
if numel(A)>577536000 %large matrices become slow do to limited RAM, serial processing is faster
disp('serial')
for i=1:size(A,3)
result_conv(:,:,i) = fftshift(fftshift(real(ifft2(conj(fft2(A(:,:,i))).*fft2(B(:,:,i)))), 1), 2); %cross-correlation code in PIVlab
end
else %smaller matrices benefit from parallelization
disp('vectorized')
result_conv = fftshift(fftshift(real(ifft2(conj(fft2(A)).*fft2(B))), 1), 2); %cross-correlation code in PIVlab
end
The number "577536000" was found by trial and error. It most likely depends on the available RAM, maybe also on Matlab version (where the functions that I use might use RAM differently). Would be cool to find this number automatically somehow, but I currently have no clue how to do this efficiently.
Ok, so this seems to work, now I have the best of both worlds: Fast speed for smaller matrices due to vectorization, and still ok performance for really large matrices with a for-loop:
clear all %#ok<CLALL>
close all
clc
% Usually, in PIV, we have input image A and input image B which are
% captured with a short pause in between. These images are then cut into
% small pieces of e.g. 64x64 pixels. Each of these "interrogation areas"
% in image A is then cross-correlated with the same part from image B.
% With the cross-correlation code in PIVlab, it is possible to to this for
% a 3D matrix (saves a lot of time, but apparently runs into RAM problems
% when matrices are large).
counter=0;
stack_sizes=1000:20000:400000 ;%these numbers are fine to demonstrate the effect on a 16 GB RAM laptop
for size_of_the_stack=stack_sizes
%% generate some quick + dirty "particle" image pairs, arranged in a 3D matrix
A=rand(64,64,size_of_the_stack,'single');
B=A;
%% do the cross-correlation with FFT
result_conv = zeros(size(A),'single'); %same starting conditions for all repetitions of the loop %#ok<PREALL> %
tic
% The following line of code eats up all the RAM
if numel(A)>577536000 %large matrices become slow do to limited RAM, serial processing is faster
disp('serial')
for i=1:size(A,3)
result_conv(:,:,i) = fftshift(fftshift(real(ifft2(conj(fft2(A(:,:,i))).*fft2(B(:,:,i)))), 1), 2); %cross-correlation code in PIVlab
end
else %smaller matrices benefit from parallelization
disp('vectorized')
result_conv = fftshift(fftshift(real(ifft2(conj(fft2(A)).*fft2(B))), 1), 2); %cross-correlation code in PIVlab
end
counter=counter+1
calc_time(counter)=toc;
time_per_subimage(counter)=calc_time(counter)/size_of_the_stack;
end
%% plot results
bar(stack_sizes,time_per_subimage*1000)
grid on;
xlabel('stack size')
ylabel('time per correlation in ms')
You might cut off some time for the smaller cases if you move the pre-allocation inside the "large-stack" part of the if-condition (you might even remove it if you can run the loop from the last image to the first in the stack), when you assign the output from the fftshift(-correlation-calculation in the small-stack-case it doesn't utilize the pre-allocation (if I've understood things properly...)

Sign in to comment.

Answers (0)

Products

Release

R2019b

Asked:

on 14 Sep 2021

Commented:

on 15 Sep 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!