Hello I wrote a code with requires hundreds of thousands of calls to convn. According to the profiler, convn takes over 80% of my code's run time!
I'm trying to get rid of loops. for example, I want to apply many convolution masks to a single matrix. Right now, I am looping. let MyArray be a matrix of size [s s BatchNum] (I convolve a batch of matrices at once instead of doing this for each matrix separately. It's about 500 at a time)
for i=1:length res(:,:,:,i)=convn(MyArray,convMask(:,:,i)); end
The matrices are small - s=1 to 9 right now (won't go past 21. no need), so total of about [9, 9, 500] in size. I use a few dozen (at most) convolution masks, each of size < 9.
I want to get rid of the loop - make it even faster. but I don't know how to convolve with many masks at once.
The next optimization I plan after this one is moving the computations to the GPU. all the operations I need to do on the matrices are fully supported on the GPU (And - please correct me if i'm wrong - looping shouldn't be a problem on the GPU)
But I don't fully understand how the storage of information on the GPU works: If a have a class (inheriting from handle), can I move most of the variables in the structure to the GPU in the constructor, and keep them there even as I pass an instance of the class from function to function? If the data remains on the GPU, I assume the convolutions will go faster.
Does parallelizing operations for the GPU have as much overhead as parfor? (I don't even know if parfor is what's used to parallelize on the GPU)
Thank you, Ayal Shwartz