PARFOR: Broadcast variable

Question

0 votes

Hi, I'm starting to use "parfor" in my MATLAB scripts.

In this simple example, for each iteration, I get a 2D 'sub_matrix' from a 2D big matrix (image).

But I got this warning message: "the entire array is a broadcast variable. This might result in unnecessary communication overhead.

% For each pixel in the 'image':
parfor row = 1:N
    for col = 1:M
        % Crop a sub matrix from the original image:
        sub_image = image(row:row+H-1, col:col+W-1); % WARNING HERE!
        
        % Do other stuffs...
    end
end

How can I modify this code to avoid this communication overhead?

5 Comments
Show 3 older comments Hide 3 older comments

Nycholas Maia on 14 Jan 2020

Open in MATLAB Online

Here is my custom MATLAB function:

The 'image' input argument is a uint8 grayscale matrix, but your size is big. May be a matrix (7000 x 5000) ... ~ 300MB of data in a single matrix.

Another information:

I will call this 'image_match' function a lot of times. So is important that this function run fast.

function [surf] = image_match(template, image, logical_idx, thr_local, thr_global)
%IMAGE_MATCH Return a normalized match image matrix using a template matrix
    %% CHECK INPUT ARGUMENTS
    if nargin < 3
        logical_idx = true(size(template));
    end
    
    if nargin < 4
        thr_local = 128;
        thr_global = 0;
    end
    
    if nargin < 5
        thr_global = 0;
    end
    %% GET THE BASIC DATA:
    image_size = size(image);
    template_size = size(template);
    
    image_height = image_size(1);
    image_width = image_size(2);
    
    template_height = template_size(1);
    template_width = template_size(2);
    
    search_height = image_height - template_height;
    search_width = image_width - template_width;
        
    %% START COMPUTATION:
    
    % Preallocate the outout matrix:
    surf = zeros(search_height, search_width);
    
    % For each xy pixel:
    parfor row = 1:search_height
        for col = 1:search_width
            % Select a sub image matrix:
            % WARNING HERE!
            sub_image = image(row:row+template_height-1, col:col+template_width-1);
            
            % Get the difference using the template matrix:
            diff = abs(double(template(logical_idx)) - double(sub_image(logical_idx)));
                       
            % Convert match/unmatch points to binary:
            diff(diff <= thr_local) = 1;
            diff(diff > thr_local) = 0;
            
            % Store the value of this template location point:
            surf(row, col) = sum(diff, 'all');
        end
    end
    
    % Normalize output matrix:
    surf = surf / max(surf, [], 'all');
    
    % Remove low/undesired correlation points:
    surf(surf <= thr_global) = 0;
end

My questions are:

As you can see, each point of the output 'surf' matrix can be calculated independent of each other. So I image that this process can be paralelized in CPU and may be in GPU.

1- How can I avoid this communication overhead inside the parfor loop?

2- How can I improve the paralelization in CPU?

3- I have a NVIDIA Geforce GTX 1070 in my machine with 2000 Cuda cores. I would like to know if this means that I could calculate 2000 'surf' points at the same time using the GPU power.

Walter Roberson on 14 Jan 2020

Open in MATLAB Online

The Parallel Computing Toolbox is the same toolbox that handles parfor and handles GPU, but they work in very different ways.

GPU performance is much reduced by indexing, and only really wins out when you have operations that can be vectorized over an entire array. For your code, that would mean continually creating new gpuArray for each subimage, and then what the GPU would accelerate would be the processing over the subimage. But that would still involve a lot of memory transfer.

You might be thinking that you could send the entire large image to GPU and create the subimages on there, but the computation engines work a bit oddly.

Computation on an NVIDIA machine is divided up into compute controllers. Each compute controller can be executing a different set of instructions than other compute controller are executing. Each compute controller is responsible for a number of computation cores. The compute controller decodes an instruction, and sends the same instruction to each of the computation cores under its control. Each computation core then executes that same instruction for its data. The way conditional execution works is not by having different compute instructions executed by different compute cores: instead, a mask is created, one per compute core, and any compute core for which the mask is not true, idles instead of executing the instruction.

So for example,

subimage = image(1:15, 1:20)

would involve implicitly creating a mask the size of image that was true for positions in the 15 x 20 upper corner, and a transfer instruction would be executed, and the compute nodes with mask true would execute the transfer instruction and the other compute nodes would idle themselves for that instruction -- compute nodes for the entire array are involved, with most of them idling.

This is very different than CPU programming, where on CPU programming it is almost always more efficient to restrict your computation to only the locations that need to be processed; on GPU you would rather have entire arrays being processed unconditionally so that you do not waste compute cores.

Walter Roberson on 14 Jan 2020

I would point out, by the way, that if you were to replicate your template several times in each direction, that you could process a corresponding sized chunk of the image. You still need to shift the window around, but you would be doing more in each chunk -- better vectorization.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Walter Roberson on 14 Jan 2020

Open in MATLAB Online

1 vote

parfor row = 1:N
    for col = 1:M
        % Crop a sub matrix from the original image:
        sub_image = image(row:row+H-1, col:col+W-1); % WARNING HERE!
        

In order for it to be possible to slice the variable, one of the dimensions of the variable would have to depend only on the parfor index (possibly plus a constant.) You can do slicing by forming separate variables:

image_slices = cell(N,1);
for row = 1 : N
    image_slices{row} = image(row:row+H-1, 1:M+W-1);
end
parfor row = 1 : N
    image_slice = image_slices{row};
    for col = 1 : M
        sub_image = image_slice(:, col:col+W-1);
        ...
    end
end

This would of course end up using a heck of a lot of memory.

I would suggest that you might be better off rewriting everything in terms of a 2D filter operation or possibly nonlinear filtering.

4 Comments
Show 2 older comments Hide 2 older comments

Nycholas Maia on 15 Jan 2020

Open in MATLAB Online

@Walter Roberson, could you help me a little bit more?

Please, look the modified function below:

function [surf] = image_match(template, image, logical_idx, thr_local, thr_global)
%IMAGE_MATCH Return a normalized match image matrix using a template matrix
    %% CHECK INPUT ARGUMENTS
    if nargin < 3
        logical_idx = true(size(template));
    end
    
    if nargin < 4
        thr_local = 128;
        thr_global = 0;
    end
    
    if nargin < 5
        thr_global = 0;
    end
    %% GET THE BASIC DATA:
    image_size = size(image);
    template_size = size(template);
        
    image_height = image_size(1);
    image_width = image_size(2);
    
    template_height = template_size(1);
    template_width = template_size(2);
    
    search_height = image_height - template_height;
    search_width = image_width - template_width;
        
    %% START COMPUTATION:
    
    % Preallocate the outout matrix:
    surf = zeros(search_height, search_width);
    
    % Preallocate a image slices cell array:
    image_slices = cell(search_height, 1);
    
    % Split image:
    for row = 1 : search_height
        image_slices{row} = image(row:row+template_height-1, 1:template_width-1);
    end
    
    % ERROR IN THE PARFOR LINE!
    % ERROR: Index in position 2 exceeds array bounds (must not exceed 8).
    parfor row = 1:search_height
        image_slice = image_slices{row};
        for col = 1:search_width
            % Select a sub image matrix:
            sub_image = image_slice(:, col:col+template_width-1);
            
            % Get the difference using the template matrix:
            % WARNING ABOUT THE 'template' VARIABLE:
            %  "the entire array is a broadcast variable. 
            % This might result in unnecessary communication overhead.
            diff = abs(double(template(logical_idx)) - double(sub_image(logical_idx)));
                       
            % Convert match/unmatch points to binary:
            diff(diff <= thr_local) = 1;
            diff(diff > thr_local) = 0;
            
            % Store the value of this template location point:
            surf(row, col) = sum(diff, 'all');
        end
    end
    
    % Normalize output matrix:
    surf = surf / max(surf, [], 'all');
    
    % Remove low/undesired correlation points:
    surf(surf <= thr_global) = 0;
end

2 Questions:

a) Why I'm getting a error in the parfor line? How can I fix it?

b) Why MATLAB is telling me that 'template' is a broadcast variable? How can I fix it?

Walter Roberson on 15 Jan 2020

You are using all of template in every worker. MATLAB must send all of it over: there is no getting around making it a broadcast variable.

Note: for increased efficiency you can pre-calculate template(logical_idx) and store it in a variable instead of calculating it in your inner loop.

Walter Roberson on 15 Jan 2020

Open in MATLAB Online

image_slices{row} = image(row:row+template_height-1, 1:template_width-1);

The 1:template_width-1 is wrong. The upper bound needs to be the such that col:col+template_width-1 fits for the largest col value -- so the upper bound must be search_width+template_width-1

Sign in to comment.

Answer 2

Matt J on 14 Jan 2020

Edited: Matt J on 14 Jan 2020

1 vote

The code looks suspiciously like an attempt at weighted normalized cross-correlation. If so, see here for a possibly faster alternative to parfor,

https://www.mathworks.com/matlabcentral/fileexchange/33340-wncc-weighted-normalized-cross-correlation?s_tid=FX_rc1_behav

3 Comments
Show 1 older comment Hide 1 older comment

Matt J on 14 Jan 2020

Edited: Matt J on 14 Jan 2020

The function uses a combination of fft2 and conv2, which are parallelized internally by Matlab, to do most of the computations.

Nycholas Maia on 14 Jan 2020

Great! Thank you again!

Sign in to comment.

PARFOR: Broadcast variable

5 Comments
Show 3 older comments Hide 3 older comments

Accepted Answer

4 Comments
Show 2 older comments Hide 2 older comments

More Answers (1)

3 Comments
Show 1 older comment Hide 1 older comment

Categories

Products

Release

Tags

Community Treasure Hunt

PARFOR: Broadcast variable

5 Comments Show 3 older comments Hide 3 older comments

Accepted Answer

4 Comments Show 2 older comments Hide 2 older comments

More Answers (1)

3 Comments Show 1 older comment Hide 1 older comment

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

5 Comments
Show 3 older comments Hide 3 older comments

4 Comments
Show 2 older comments Hide 2 older comments

3 Comments
Show 1 older comment Hide 1 older comment