Efficient training of LSTM network with GPU

11 views (last 30 days)
Hi all,
I recently introduced a GPU implemented computer and currently trying to refactor my LSTM codes to take advantage of GPU. However, I found my implementation doesn't show improvement on speed, actually using CPU is faster than using GPU. Below testing codes are testing of basic algorithm of LSTM for comparison. Could anyone give some advice on how to employ the potential of GPU for LSTM? I tried using pagefun, arrayfun and bsxfun but they seemed not working to improve speed.
This one is for GPU.
function LSTM_gpu2()
vis = 700; hid = 500;
T = 80; epochs = 10;
sigmoid = @(x) 1./(1+exp(-x));
x = rand(vis,1,T); h = zeros(hid,1,T+1); c = h;
W_z = rand(hid,vis,'gpuArray'); W_i = rand(hid,vis,'gpuArray');
W_f = rand(hid,vis,'gpuArray'); W_o = rand(hid,vis,'gpuArray');
R_z = rand(hid,hid,'gpuArray'); R_i = rand(hid,hid,'gpuArray');
R_f = rand(hid,hid,'gpuArray'); R_o = rand(hid,hid,'gpuArray');
P_i = diag(rand(hid,1,'gpuArray')); P_f = diag(rand(hid,1,'gpuArray'));
P_o = diag(rand(hid,1,'gpuArray'));
b_z = rand(hid,1,'gpuArray'); b_i = rand(hid,1,'gpuArray');
b_f = rand(hid,1,'gpuArray'); b_o = rand(hid,1,'gpuArray');
I = zeros(hid,T,'gpuArray'); F = zeros(hid,T,'gpuArray');
O = zeros(hid,T,'gpuArray'); G = zeros(hid,T,'gpuArray');
x = gpuArray(x); h = gpuArray(h); c = gpuArray(c);
tic;
for i=1:epochs
for t=1:T
G(:,t) = tanh(W_z*x(:,:,t) + R_z*h(:,:,t) + b_z);
I(:,t) = sigmoid(W_i*x(:,:,t) + R_i*h(:,:,t) + P_i*c(:,:,t) + b_i);
F(:,t) = sigmoid(W_f*x(:,:,t) + R_f*h(:,:,t) + P_f*c(:,:,t) + b_f);
c(:,:,t+1) = G(:,t).*I(:,t) + c(:,:,t).*F(:,t);
O(:,t) = sigmoid(W_o*x(:,:,t) + R_o*h(:,:,t) + P_o*c(:,:,t+1) + b_o);
h(:,:,t+1) = tanh(c(:,:,t+1)).*O(:,t);
end
%%backprop
%%update
end
toc;
return;
And this one is for CPU.
function LSTM_cpu()
vis = 700; hid = 500;
T = 80; epochs = 10;
sigmoid = @(x) 1./(1+exp(-x));
x = rand(vis,1,T); h = zeros(hid,1,T+1); c = h;
W_z = rand(hid,vis); W_i = rand(hid,vis);
W_f = rand(hid,vis); W_o = rand(hid,vis);
R_z = rand(hid,hid); R_i = rand(hid,hid);
R_f = rand(hid,hid); R_o = rand(hid,hid);
P_i = diag(rand(hid,1)); P_f = diag(rand(hid,1));
P_o = diag(rand(hid,1));
b_z = rand(hid,1); b_i = rand(hid,1);
b_f = rand(hid,1); b_o = rand(hid,1);
I = zeros(hid,T); F = zeros(hid,T);
O = zeros(hid,T); G = zeros(hid,T);
tic;
for i=1:epochs
for t=1:T
G(:,t) = tanh(W_z*x(:,:,t) + R_z*h(:,:,t) + b_z);
I(:,t) = sigmoid(W_i*x(:,:,t) + R_i*h(:,:,t) + P_i*c(:,:,t) + b_i);
F(:,t) = sigmoid(W_f*x(:,:,t) + R_f*h(:,:,t) + P_f*c(:,:,t) + b_f);
c(:,:,t+1) = G(:,t).*I(:,t) + c(:,:,t).*F(:,t);
O(:,t) = sigmoid(W_o*x(:,:,t) + R_o*h(:,:,t) + P_o*c(:,:,t+1) + b_o);
h(:,:,t+1) = tanh(c(:,:,t+1)).*O(:,t);
end
%%backprop
%%update
end
toc;
return;
OS: Windows 10,
GPU: NVIDIA Quadro M5000,
CPU: Intel i7-5820K,
MATLAB: R2016a
Thank you,
Yuto Ozaki
  1 Comment
Yuto Ozaki
Yuto Ozaki on 10 Apr 2016
Edited: Yuto Ozaki on 10 Apr 2016
Additional question:
Some papers[1] [2] use affine transform notation to realize a more compact way of calculation but they do not using peephole connections. In fact, Chainer's LSTM model does not implement peephole connections and TensorFlow provides LSTM models both having and not having peephole connections. To pursue calculation efficiency, would omitting peephole be the current best practice? If a model does not include peephole, all affine transform can be done at once and I think it can lead to more GPU-friendly coding.
[1] Kelvin Xu, et al.: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015)
[2] Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals: RECURRENT NEURAL NETWORK REGULARIZATION (2014)

Sign in to comment.

Accepted Answer

Joss Knight
Joss Knight on 15 Apr 2016
To get good performance out of the GPU, you need to give it a lot of data to process. Your best bet is to vectorize your code to remove the inner loop. Your sigmoid and tanh activation functions, for instance, are element-wise operators and so should vectorize trivially, while your matrix multiplies can be executed in batch using pagefun.
Alternatively, have you considered using the new Deep Learning features in the Neural Network Toolbox in MATLAB R2016a, or the free 3rd party deep learning solution MatConvNet?
  2 Comments
Yuto Ozaki
Yuto Ozaki on 16 Apr 2016
Joss,
Thank you for your reply. I just tried with bigger size samples training with mini batch and it yielded around 35% faster speed on GPU. However, I think removing the inner loop would be challenging since RNN basically gets input from previous state and that would make sequential for-loop be essential algorithm of RNN.
I have checked Neural Network Toolbox but seemingly the toolbox has not implemented RNN. My main interest is in music information retrieval so time-series models such as RNN and other variants are my main focus.
Joss Knight
Joss Knight on 20 Apr 2016
Support for RNNs is considered high priority by the development team. Meanwhile, take a look at MatConvNet.

Sign in to comment.

More Answers (0)

Categories

Find more on Sequence and Numeric Feature Data Workflows in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!