Why is my code running slower on the GPU?

19 views (last 30 days)
AlexRD on 30 Mar 2021
Commented: AlexRD on 5 Apr 2021
I've been writing a deep learning neural network model by scratch, so i can have an intuitive understanding of them. The code i've written works fine, and i've spent a great amount of time optimizing it, but i seem to have reached a bottleneck that is the GPU code. I've implemented a dynamic network through the use of structures, with the structure vector representing layer depth. This model uses sigmoid activation functions, and cross-entropy cost function.
First things first, there are three files: The main script, and the backprop and feed_forward functions.
The main script
clear all; close all;
%% Load Data
load ('numbers.mat');
for i=1:length(numbers)
temp = numbers(i).label;
numbers(i).label = zeros(1,10);
numbers(i).label(temp+1) = 1;
validation = numbers(1:10000);
training = numbers(10001:end);
%% Hyperparameters
batch_size = 10;
numEpochs = 5;
rateFunc = interp1(0.5 ./ [1:20], linspace(1, 20, numEpochs));
numInput = size(training(1).data, 1) * size(training(1).data, 2);
%% Initialization
net = create_net([numInput 100 10]);
numLayers = length(net);
average = [];
%% Main
for epoch=1:numEpochs
%% Backprop
randIndex = randperm(size(training,2));
for i=1:batch_size:length(training)-batch_size
[net, gradient] = backprop(net, training(randIndex(i:i+batch_size-1)), rateFunc(epoch));
%% Validate Net
fprintf ('Epoch(%d): %fs', epoch, toc);
[average(end+1), error] = validate_net(net, validation);
if mod(epoch, 5) == 0
train_error = validate_net(net, training);
fprintf ('\nError(Training): %f\n', train_error);
if train_error >= 0.99, break; end
fprintf ('\nError: %f', average(end));
fprintf ('\n---------------\n');
%% Functions
function [average, error] = validate_net(net, inputData)
error = [];
for i=1:size(inputData,2)
layer = feed_forward(net, inputData(i).data);
[~,ix] = max(layer(end).a);
[~,iy] = max(inputData(i).label);
error = [error; [ix-1, iy-1]];
average = mean(error(:,1) == error(:,2));
function net = create_net(structure)
numLayers = length(structure) - 1;
net = struct('b', [], 'w', cell(1, numLayers));
for i=1:numLayers
net(i).w = (randn(structure(i), structure(i+1))/sqrt(structure(i)));
net(i).b = (randn(1, structure(i+1)));
function [net, gradient] = backprop(net, inputData, rate)
numLayers = length(net);
delta = struct('b', [], 'w', cell(1, length(net)));
gradient = struct('b', 0, 'w', num2cell(zeros(1, length(net))));
for i=1:length(inputData)
layer = feed_forward(net, inputData(i).data);
delta(numLayers).b = layer(numLayers).a - inputData(i).label;
delta(numLayers).w = layer(numLayers-1).a' * delta(numLayers).b;
for L=numLayers-1:-1:2
delta(L).b = (delta(L+1).b * net(L+1).w') .* 1./(1 + exp(-layer(L).z)) .* (1 - 1./(1 + exp(-layer(L).z)));
delta(L).w = layer(L-1).a' * delta(L).b;
delta(1).b = (delta(2).b * net(2).w') .* 1./(1 + exp(-layer(1).z)) .* (1 - 1./(1 + exp(-layer(1).z)));
delta(1).w = inputData(i).data' * delta(1).b;
for L=1:numLayers
gradient(L).b = gradient(L).b + delta(L).b;
gradient(L).w = gradient(L).w + delta(L).w;
for L=1:numLayers
net(L).b = net(L).b - rate/length(inputData)*gradient(L).b;
net(L).w = net(L).w - rate/length(inputData)*gradient(L).w;
function layer = feed_forward(net, inputData)
layer = struct('z', [], 'a', cell(1, length(net)));
layer(1).z = inputData * net(1).w + net(1).b;
layer(1).a = 1./ (1 + exp(-layer(1).z));
for i=2:length(net)
layer(i).z = layer(i-1).a * net(i).w + net(i).b;
layer(i).a = 1./ (1 + exp(-layer(i).z));
The dataset I'm using is the classic MNIST digit recognition problem, and I've been able to get close to 98% accuracy on it. It takes roughly 5 seconds to run per epoch, but on the GPU it takes 6 times this amount. I use the GPU by changing the create_new function, like so:
function net = create_net(structure)
numLayers = length(structure) - 1;
net = struct('b', [], 'w', cell(1, numLayers));
for i=1:numLayers
net(i).w = gpuArray(randn(structure(i), structure(i+1))/sqrt(structure(i)));
net(i).b = gpuArray(randn(1, structure(i+1)));
Am i doing something wrong here? Would appreciate any feedback on optimizing the code, and how to solve this GPU issue.
Thanks for reading
AlexRD on 31 Mar 2021
Increasing the batch size has little effect on the time it takes to finish an epoch in my algorithm. I think it's because the amount of calculations per epoch is fixed, but the time it takes to train the network is significantly increased.
Changing it from 10 to 100 gives me better time per epoch actually, since i imagine there are less function calls for backprop (from ~5s on the CPU to ~4.5s, and same for the GPU), but the time it takes for the network to fully finish training is increased proportional to the batch size.

Sign in to comment.

Accepted Answer

Joss Knight
Joss Knight on 31 Mar 2021
Increasing the batch size alone cannot improve convergence in a simple MLP, you need to match it with an increase to the learning rate.
But more to the point of your question, does increasing the batch size improve the GPU performance relative to the CPU?

Sign in to comment.

More Answers (0)




Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!