Converting parallel CPU processing into GPU processing

Question

0 votes

I am trying to convert code that ran in parallel on CPU cores into parallel processing on the gpu.

I would like to process matrices in a cell array on the GPU in parallel for how many cores are present on the gpu. However, it performs significantly slower than on a parallel CPU processor of 4 cores (25 cells processed in 30 minutes on 4 CPU cores, 5 cells is currently taking over 45 minutes to process on GPU and is still not finished). I'm very new to GPU computing and nothing seemed really obvious on how to speed this up.

GPU properties:

Data to be processed:

series is a 568x1 cell array
each cell is a 60x60 double (each entry is a value between -1 and 1)

Start processing

tic % test 
for i = 1:5
    cell_array{i} = gpuArray(cleanSeries{i});
end

Determine size of matrix within the first cell, equivalent to number of biological cells recorded

numCells = gpuArray(length(cell_array{1}));

Preallocate arrays for data

clust_mean = gpuArray(NaN(length(cell_array{1}),length(cell_array)));
clust_std = gpuArray(NaN(length(cell_array{1}),length(cell_array)));
clust_random_mean = gpuArray(NaN(length(cell_array{1}),length(cell_array)));
clust_random_std = gpuArray(NaN(length(cell_array{1}),length(cell_array)));

Initiate the processing

parfor cellNumber = 1:length(cell_array)
    threshold_clust = gpuArray(NaN(numCells,100));
    random_clust = gpuArray(NaN(numCells,100));
    % process data over varying proportional thresholds starting at 25%
    % strongest to fully connected (%100) at 25% steps i.e. 25%, 50%, 75%,
    % 100% 
    for threshold = 25:25:100
        threshold_matrix = (threshold_proportional(cell_array{cellNumber}, threshold/100)); % proportional threshold matrix - custom function
        % clustering requires that all values be between 0 and 1 so remove
        % any negatives
        threshold_matrix(threshold_matrix < 0) = 0;      
        % ensure that randomizing the matrix is possible
        [rowi,coli] = find(tril(threshold_matrix));
        bothi = [rowi coli];
        c = bothi(1,1);
        d = bothi(1,2);
        e=find(c==bothi);
        f=find(d==bothi);
        if length(e)==length(bothi)||length(f)==length(bothi) 
            disp(['One cell has all the connections, skipping ', int2str(threshold), '% threshold.'])
            threshold_clust(:,threshold) = NaN(numCells,1);
            random_clust(:,threshold) = NaN(numCells,1);
        elseif length(bothi) <=3
            threshold_clust(:,threshold) = NaN(numCells,1);
            random_clust(:,threshold) = NaN(numCells,1);
        else
            % create random matrix - custom function
            random_matrix = latmio_und(threshold_matrix,1000); 
            % clustering coefficient per matrix - custom function
            threshold_clust(:,threshold) = clustering_coef_wu(threshold_matrix);
            random_clust(:,threshold) = clustering_coef_wu(random_matrix);
        end % if logic end
    end % for loop end
    % concatenate over thresholds
    clust_mean(:,cellNumber) = mean(threshold_clust,2,'omitnan');
    clust_std(:,cellNumber) = std(threshold_clust,0,2,'omitnan');
    clust_random_mean(:,cellNumber) = mean(random_clust,2,'omitnan');
    clust_random_std(:,cellNumber) = std(random_clust,0,2,'omitnan');
     
end % parfor loop end
gather(clust_mean);
gather(clust_std)
gather(clust_random_std);
gather(clust_random_mean);
toc

6 Comments
Show 4 older comments Hide 4 older comments

Walter Roberson on 11 Mar 2022

Multiprocessor Count appears to give the number of SMM (Streaming Maxwell Multiprocessor), https://developer.nvidia.com/blog/5-things-you-should-know-about-new-maxwell-gpu-architecture/ which appear to have to do with logic control and scheduling. If the material in the blog is relevant to your release and I have understood it correctly, it looks like your GPU has access to 22*4 = 88 schedulers. Each GPU core is controlled by one scheduler, and all of the GPUs currently being controlled by the same scheduler are required to process exactly the same instruction (and there is a bit mask to tell particular processors to "sit this one out".)

Everything having to do with microkernels and scheduling is at a level you cannot control. Most of it is pre-programmed, either by Mathworks or by NVIDIA.

As far as your code is concerned, it only has access to one GPU. How tasks get scheduled for that is completely behind-the-scenes. Tasks are not necessarily scheduled strictly in the order of the code. If your code asked for A*B to be computed and that was not going to take up the entire scheduling capacity, then potentially C+D could also be computed while A*B is being computed.

If you need control over scheduling of the computations to internal SMM, then use GPU Coder and carefully-programmed C code compiled into GPU kernels.

The 22 SMM are not available for parfor.

The relationship between parfor and GPU is this: Any one worker (or the client) can only select one GPU at a time. So if you have multiple GPU then you can use parfor to work on independent GPUs.

And of course, you might choose to program in such a way that a parfor worker that knows there is no GPU available for it might choose to do some work on CPU.

You do not get to control different SMM by different parfor workers.

Douglas Miller on 12 Mar 2022

According to the other post, it sounds like running in parallel isn't feasible on the GPU the way I was hoping. But I had never considered that zeros would process quicker. That will definitely help optimize the code. Thank you so much!

Walter Roberson on 12 Mar 2022

For operations other than pure copying, NaN has to go through a special "Abort" path in all calculations; calculations with it cannot stream the normal way. There also has to be special checking to see if the NaN is a "signalling NaN" as signalling NaN are required to raise exceptions whenever they occur.

inf cannot readily stream either... but I guess a bit more readily than NaN.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Matt J on 12 Mar 2022

Edited: Matt J on 12 Mar 2022

0 votes

I would like to process matrices in a cell array on the GPU in parallel for how many cores are present on the gpu.

No, GPU cores cannot act like parpool workers. They are a completely different animal.

1 Comment
Show -1 older comments Hide -1 older comments

Douglas Miller on 12 Mar 2022

I was afraid of that. Thank you for the clarification!

Sign in to comment.

Converting parallel CPU processing into GPU processing

6 Comments
Show 4 older comments Hide 4 older comments

Answers (1)

1 Comment
Show -1 older comments Hide -1 older comments

Categories

Products

Release

Tags

Community Treasure Hunt

Converting parallel CPU processing into GPU processing

6 Comments Show 4 older comments Hide 4 older comments

Answers (1)

1 Comment Show -1 older comments Hide -1 older comments

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

6 Comments
Show 4 older comments Hide 4 older comments

1 Comment
Show -1 older comments Hide -1 older comments