Converting parallel CPU processing into GPU processing
Show older comments
I am trying to convert code that ran in parallel on CPU cores into parallel processing on the gpu.
I would like to process matrices in a cell array on the GPU in parallel for how many cores are present on the gpu. However, it performs significantly slower than on a parallel CPU processor of 4 cores (25 cells processed in 30 minutes on 4 CPU cores, 5 cells is currently taking over 45 minutes to process on GPU and is still not finished). I'm very new to GPU computing and nothing seemed really obvious on how to speed this up.
GPU properties:

Data to be processed:
- series is a 568x1 cell array
- each cell is a 60x60 double (each entry is a value between -1 and 1)
Start processing
tic % test
for i = 1:5
cell_array{i} = gpuArray(cleanSeries{i});
end
Determine size of matrix within the first cell, equivalent to number of biological cells recorded
numCells = gpuArray(length(cell_array{1}));
Preallocate arrays for data
clust_mean = gpuArray(NaN(length(cell_array{1}),length(cell_array)));
clust_std = gpuArray(NaN(length(cell_array{1}),length(cell_array)));
clust_random_mean = gpuArray(NaN(length(cell_array{1}),length(cell_array)));
clust_random_std = gpuArray(NaN(length(cell_array{1}),length(cell_array)));
Initiate the processing
parfor cellNumber = 1:length(cell_array)
threshold_clust = gpuArray(NaN(numCells,100));
random_clust = gpuArray(NaN(numCells,100));
% process data over varying proportional thresholds starting at 25%
% strongest to fully connected (%100) at 25% steps i.e. 25%, 50%, 75%,
% 100%
for threshold = 25:25:100
threshold_matrix = (threshold_proportional(cell_array{cellNumber}, threshold/100)); % proportional threshold matrix - custom function
% clustering requires that all values be between 0 and 1 so remove
% any negatives
threshold_matrix(threshold_matrix < 0) = 0;
% ensure that randomizing the matrix is possible
[rowi,coli] = find(tril(threshold_matrix));
bothi = [rowi coli];
c = bothi(1,1);
d = bothi(1,2);
e=find(c==bothi);
f=find(d==bothi);
if length(e)==length(bothi)||length(f)==length(bothi)
disp(['One cell has all the connections, skipping ', int2str(threshold), '% threshold.'])
threshold_clust(:,threshold) = NaN(numCells,1);
random_clust(:,threshold) = NaN(numCells,1);
elseif length(bothi) <=3
threshold_clust(:,threshold) = NaN(numCells,1);
random_clust(:,threshold) = NaN(numCells,1);
else
% create random matrix - custom function
random_matrix = latmio_und(threshold_matrix,1000);
% clustering coefficient per matrix - custom function
threshold_clust(:,threshold) = clustering_coef_wu(threshold_matrix);
random_clust(:,threshold) = clustering_coef_wu(random_matrix);
end % if logic end
end % for loop end
% concatenate over thresholds
clust_mean(:,cellNumber) = mean(threshold_clust,2,'omitnan');
clust_std(:,cellNumber) = std(threshold_clust,0,2,'omitnan');
clust_random_mean(:,cellNumber) = mean(random_clust,2,'omitnan');
clust_random_std(:,cellNumber) = std(random_clust,0,2,'omitnan');
end % parfor loop end
gather(clust_mean);
gather(clust_std)
gather(clust_random_std);
gather(clust_random_mean);
toc
6 Comments
Walter Roberson
on 11 Mar 2022
Edited: Walter Roberson
on 11 Mar 2022
The 980 Ti runs double precision at 1/32 of the clock rate, which is the worst ratio that Nvidia makes.
Douglas Miller
on 11 Mar 2022
Walter Roberson
on 11 Mar 2022
Multiprocessor Count appears to give the number of SMM (Streaming Maxwell Multiprocessor), https://developer.nvidia.com/blog/5-things-you-should-know-about-new-maxwell-gpu-architecture/ which appear to have to do with logic control and scheduling. If the material in the blog is relevant to your release and I have understood it correctly, it looks like your GPU has access to 22*4 = 88 schedulers. Each GPU core is controlled by one scheduler, and all of the GPUs currently being controlled by the same scheduler are required to process exactly the same instruction (and there is a bit mask to tell particular processors to "sit this one out".)
Everything having to do with microkernels and scheduling is at a level you cannot control. Most of it is pre-programmed, either by Mathworks or by NVIDIA.
As far as your code is concerned, it only has access to one GPU. How tasks get scheduled for that is completely behind-the-scenes. Tasks are not necessarily scheduled strictly in the order of the code. If your code asked for A*B to be computed and that was not going to take up the entire scheduling capacity, then potentially C+D could also be computed while A*B is being computed.
If you need control over scheduling of the computations to internal SMM, then use GPU Coder and carefully-programmed C code compiled into GPU kernels.
The 22 SMM are not available for parfor.
The relationship between parfor and GPU is this: Any one worker (or the client) can only select one GPU at a time. So if you have multiple GPU then you can use parfor to work on independent GPUs.
And of course, you might choose to program in such a way that a parfor worker that knows there is no GPU available for it might choose to do some work on CPU.
You do not get to control different SMM by different parfor workers.
Walter Roberson
on 11 Mar 2022
clust_mean = gpuArray(NaN(length(cell_array{1}),length(cell_array)));
There are technical reasons why working with NaN can be slower than working with inf, and technical reasons why working with inf can be slower than working with finite values.
Also, that code allocates the NaN array on the CPU and then copies it into the GPU.
Putting those two together, you can be more efficient:
clust_mean = zeros(length(cell_array{1}), length(cell_array), 'gpuArray');
Douglas Miller
on 12 Mar 2022
Walter Roberson
on 12 Mar 2022
For operations other than pure copying, NaN has to go through a special "Abort" path in all calculations; calculations with it cannot stream the normal way. There also has to be special checking to see if the NaN is a "signalling NaN" as signalling NaN are required to raise exceptions whenever they occur.
inf cannot readily stream either... but I guess a bit more readily than NaN.
Answers (1)
I would like to process matrices in a cell array on the GPU in parallel for how many cores are present on the gpu.
No, GPU cores cannot act like parpool workers. They are a completely different animal.
1 Comment
Douglas Miller
on 12 Mar 2022
Categories
Find more on Parallel Computing Toolbox in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!