(Numbers.mat is too big for 5mb restriction, so here's a link for numbers.mat https://drive.google.com/file/d/1GnSfTkDD1GYzy26Y5nhpNXaCZgbNHVlf/view?usp=sharing )

I've been writing a deep learning neural network model by scratch, so i can have an intuitive understanding of them. The code i've written works fine, and i've spent a great amount of time optimizing it, but i seem to have reached a bottleneck that is the GPU code. I've implemented a dynamic network through the use of structures, with the structure vector representing layer depth. This model uses sigmoid activation functions, and cross-entropy cost function.

First things first, there are three files: The main script, and the backprop and feed_forward functions.

The main script

clc;

clear all; close all;

%% Load Data

load ('numbers.mat');

for i=1:length(numbers)

temp = numbers(i).label;

numbers(i).label = zeros(1,10);

numbers(i).label(temp+1) = 1;

end

validation = numbers(1:10000);

training = numbers(10001:end);

%% Hyperparameters

batch_size = 10;

numEpochs = 5;

rateFunc = interp1(0.5 ./ [1:20], linspace(1, 20, numEpochs));

numInput = size(training(1).data, 1) * size(training(1).data, 2);

%% Initialization

net = create_net([numInput 100 10]);

numLayers = length(net);

average = [];

%% Main

for epoch=1:numEpochs

tic;

%% Backprop

randIndex = randperm(size(training,2));

for i=1:batch_size:length(training)-batch_size

[net, gradient] = backprop(net, training(randIndex(i:i+batch_size-1)), rateFunc(epoch));

end

%% Validate Net

fprintf ('Epoch(%d): %fs', epoch, toc);

[average(end+1), error] = validate_net(net, validation);

if mod(epoch, 5) == 0

train_error = validate_net(net, training);

fprintf ('\nError(Training): %f\n', train_error);

if train_error >= 0.99, break; end

end

fprintf ('\nError: %f', average(end));

fprintf ('\n---------------\n');

end

%% Functions

function [average, error] = validate_net(net, inputData)

error = [];

for i=1:size(inputData,2)

layer = feed_forward(net, inputData(i).data);

[~,ix] = max(layer(end).a);

[~,iy] = max(inputData(i).label);

error = [error; [ix-1, iy-1]];

average = mean(error(:,1) == error(:,2));

end

end

function net = create_net(structure)

numLayers = length(structure) - 1;

net = struct('b', [], 'w', cell(1, numLayers));

for i=1:numLayers

net(i).w = (randn(structure(i), structure(i+1))/sqrt(structure(i)));

net(i).b = (randn(1, structure(i+1)));

end

end

Backprop

function [net, gradient] = backprop(net, inputData, rate)

numLayers = length(net);

delta = struct('b', [], 'w', cell(1, length(net)));

gradient = struct('b', 0, 'w', num2cell(zeros(1, length(net))));

for i=1:length(inputData)

layer = feed_forward(net, inputData(i).data);

delta(numLayers).b = layer(numLayers).a - inputData(i).label;

delta(numLayers).w = layer(numLayers-1).a' * delta(numLayers).b;

for L=numLayers-1:-1:2

delta(L).b = (delta(L+1).b * net(L+1).w') .* 1./(1 + exp(-layer(L).z)) .* (1 - 1./(1 + exp(-layer(L).z)));

delta(L).w = layer(L-1).a' * delta(L).b;

end

delta(1).b = (delta(2).b * net(2).w') .* 1./(1 + exp(-layer(1).z)) .* (1 - 1./(1 + exp(-layer(1).z)));

delta(1).w = inputData(i).data' * delta(1).b;

for L=1:numLayers

gradient(L).b = gradient(L).b + delta(L).b;

gradient(L).w = gradient(L).w + delta(L).w;

end

end

for L=1:numLayers

net(L).b = net(L).b - rate/length(inputData)*gradient(L).b;

net(L).w = net(L).w - rate/length(inputData)*gradient(L).w;

end

end

Feed_forward

function layer = feed_forward(net, inputData)

layer = struct('z', [], 'a', cell(1, length(net)));

layer(1).z = inputData * net(1).w + net(1).b;

layer(1).a = 1./ (1 + exp(-layer(1).z));

for i=2:length(net)

layer(i).z = layer(i-1).a * net(i).w + net(i).b;

layer(i).a = 1./ (1 + exp(-layer(i).z));

end

end

The dataset I'm using is the classic MNIST digit recognition problem, and I've been able to get close to 98% accuracy on it. It takes roughly 5 seconds to run per epoch, but on the GPU it takes 6 times this amount. I use the GPU by changing the create_new function, like so:

function net = create_net(structure)

numLayers = length(structure) - 1;

net = struct('b', [], 'w', cell(1, numLayers));

for i=1:numLayers

net(i).w = gpuArray(randn(structure(i), structure(i+1))/sqrt(structure(i)));

net(i).b = gpuArray(randn(1, structure(i+1)));

end

end

Am i doing something wrong here? Would appreciate any feedback on optimizing the code, and how to solve this GPU issue.

Thanks for reading

Joss Knight
on 31 Mar 2021

Increasing the batch size alone cannot improve convergence in a simple MLP, you need to match it with an increase to the learning rate.

But more to the point of your question, does increasing the batch size improve the GPU performance relative to the CPU?

