PARFOR: Broadcast variable
Show older comments
Hi, I'm starting to use "parfor" in my MATLAB scripts.
In this simple example, for each iteration, I get a 2D 'sub_matrix' from a 2D big matrix (image).
But I got this warning message: "the entire array is a broadcast variable. This might result in unnecessary communication overhead.
% For each pixel in the 'image':
parfor row = 1:N
for col = 1:M
% Crop a sub matrix from the original image:
sub_image = image(row:row+H-1, col:col+W-1); % WARNING HERE!
% Do other stuffs...
end
end
How can I modify this code to avoid this communication overhead?
5 Comments
Matt J
on 13 Jan 2020
I doubt you need to. I can't imagine that the broadcast of a single image is going to be a big bottleneck. But if you show more of the computation, we might be able to recommend further optimization.
Nycholas Maia
on 14 Jan 2020
Matt J
on 14 Jan 2020
I have a NVIDIA Geforce GTX 1070 in my machine with 2000 Cuda cores. I would like to know if this means that I could calculate 2000 'surf' points at the same time using the GPU power.
No, parfor does not use GPU cores in any way.
Walter Roberson
on 14 Jan 2020
The Parallel Computing Toolbox is the same toolbox that handles parfor and handles GPU, but they work in very different ways.
GPU performance is much reduced by indexing, and only really wins out when you have operations that can be vectorized over an entire array. For your code, that would mean continually creating new gpuArray for each subimage, and then what the GPU would accelerate would be the processing over the subimage. But that would still involve a lot of memory transfer.
You might be thinking that you could send the entire large image to GPU and create the subimages on there, but the computation engines work a bit oddly.
Computation on an NVIDIA machine is divided up into compute controllers. Each compute controller can be executing a different set of instructions than other compute controller are executing. Each compute controller is responsible for a number of computation cores. The compute controller decodes an instruction, and sends the same instruction to each of the computation cores under its control. Each computation core then executes that same instruction for its data. The way conditional execution works is not by having different compute instructions executed by different compute cores: instead, a mask is created, one per compute core, and any compute core for which the mask is not true, idles instead of executing the instruction.
So for example,
subimage = image(1:15, 1:20)
would involve implicitly creating a mask the size of image that was true for positions in the 15 x 20 upper corner, and a transfer instruction would be executed, and the compute nodes with mask true would execute the transfer instruction and the other compute nodes would idle themselves for that instruction -- compute nodes for the entire array are involved, with most of them idling.
This is very different than CPU programming, where on CPU programming it is almost always more efficient to restrict your computation to only the locations that need to be processed; on GPU you would rather have entire arrays being processed unconditionally so that you do not waste compute cores.
Walter Roberson
on 14 Jan 2020
I would point out, by the way, that if you were to replicate your template several times in each direction, that you could process a corresponding sized chunk of the image. You still need to shift the window around, but you would be doing more in each chunk -- better vectorization.
Accepted Answer
More Answers (1)
The code looks suspiciously like an attempt at weighted normalized cross-correlation. If so, see here for a possibly faster alternative to parfor,
Categories
Find more on Images in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!