Optimising LSTM training on GPU for sequence classification

I'm classifying time sequences using LSTM. I have a massive dataset and training is unfeasibly slow despite using a high performance GPU with 11Gb RAM (1080Ti). The GPU is only running about 20% utilisation most of the time and I suspect transfer from memory is slowing it down. I've moved the 5Gb input array (cell array of time series arrays) to the GPU using nndata2gpu. I can't move the target/response array to the GPU as it is a categorical column, as required with an LSTM layer with 'OutputMode' of 'Last'. When I attempt to train the network it won't recognise the response variable - generates an error message that 'a column vector of categorical data' is required (which is what the array is). When I use the same response array without moving the input array to the GPU first, the network trains fine (albeit slowly). So is there a way of training a network on the GPU, using training data stored on the GPU while the response array is in RAM? Or is there another way to train a multi-feature sequence classification network with a response variable that is in a format that can reside on the GPU?

5 Comments

If you suspect transfer speed: what mainboard are you using, and what other PCIe slots (x16, x1 etc.) are used on that board?
Well, LSTM is part of the Deep Learning framework, but nndata2gpu is part of the classic neural network framework. So what you were really looking for is gpuArray(). Still, I very much doubt data transfer is your bottleneck. Utilisation is reported low for all sorts of reasons, part of which could be, simply, that many deep learning operations use both CPU and GPU and one can be idle waiting for the other to complete whatever because there are some things that just can't be parallelised.
Regardless, the best solution here is for you to post some example code because it's hard to tell what's going on from a description alone. Certainly, there's no reason why you shouldn't be able to have the data on the GPU and the response on the CPU (well, it has to be, because you can't have categorical gpuArray variables).
Thanks for your answers. I agree with the comments above regarding other potential bottlenecks. I still can't get it to train on GPU input data with the response in RAM, however I have increased the minibatch size substantially (from 128 to 2048) and processing has sped up dramatically with associated increases in GPU utilisation. Although the training characteristics are a little different with bigger minibatches, the trained network seems to converge at the same point as a network trained (more slowly) with smaller minibatches.
Glad to hear you were able to increase the speed. My comment aimed to point out, that your mainboard could use only x8 lanes of x16 lanes in the slot where your 1080 sits, if you have some other PCI card in another/the second PCIe x16 slot. (i.e. it's 1x16 or 2x8) You can detect the available lanes with GPUz. All the best
Can you please give some example code because this could be a bug, but equally, if you are still using nndata2gpu, it is just a misunderstanding about how to move data to the device.
If you are getting better efficiency increasing the minibatch size then it means that you have a relatively small and fast network so if you don't process enough data at once, the performance characteristics are dominated by the overheads of the MATLAB interpreter, file I/O and so forth. Note that as a general rule, if you increase the minibatch size you can increase the learning rate in proportion. This should give faster convergence.

Sign in to comment.

Answers (0)

Tags

Asked:

on 9 Aug 2018

Commented:

on 16 Aug 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!