Optimising LSTM training on GPU for sequence classification
Show older comments
I'm classifying time sequences using LSTM. I have a massive dataset and training is unfeasibly slow despite using a high performance GPU with 11Gb RAM (1080Ti). The GPU is only running about 20% utilisation most of the time and I suspect transfer from memory is slowing it down. I've moved the 5Gb input array (cell array of time series arrays) to the GPU using nndata2gpu. I can't move the target/response array to the GPU as it is a categorical column, as required with an LSTM layer with 'OutputMode' of 'Last'. When I attempt to train the network it won't recognise the response variable - generates an error message that 'a column vector of categorical data' is required (which is what the array is). When I use the same response array without moving the input array to the GPU first, the network trains fine (albeit slowly). So is there a way of training a network on the GPU, using training data stored on the GPU while the response array is in RAM? Or is there another way to train a multi-feature sequence classification network with a response variable that is in a format that can reside on the GPU?
5 Comments
Paul Siefert
on 10 Aug 2018
If you suspect transfer speed: what mainboard are you using, and what other PCIe slots (x16, x1 etc.) are used on that board?
Joss Knight
on 15 Aug 2018
Edited: Joss Knight
on 15 Aug 2018
Well, LSTM is part of the Deep Learning framework, but nndata2gpu is part of the classic neural network framework. So what you were really looking for is gpuArray(). Still, I very much doubt data transfer is your bottleneck. Utilisation is reported low for all sorts of reasons, part of which could be, simply, that many deep learning operations use both CPU and GPU and one can be idle waiting for the other to complete whatever because there are some things that just can't be parallelised.
Regardless, the best solution here is for you to post some example code because it's hard to tell what's going on from a description alone. Certainly, there's no reason why you shouldn't be able to have the data on the GPU and the response on the CPU (well, it has to be, because you can't have categorical gpuArray variables).
Leo Nunnink
on 16 Aug 2018
Paul Siefert
on 16 Aug 2018
Glad to hear you were able to increase the speed. My comment aimed to point out, that your mainboard could use only x8 lanes of x16 lanes in the slot where your 1080 sits, if you have some other PCI card in another/the second PCIe x16 slot. (i.e. it's 1x16 or 2x8) You can detect the available lanes with GPUz. All the best
Joss Knight
on 16 Aug 2018
Can you please give some example code because this could be a bug, but equally, if you are still using nndata2gpu, it is just a misunderstanding about how to move data to the device.
If you are getting better efficiency increasing the minibatch size then it means that you have a relatively small and fast network so if you don't process enough data at once, the performance characteristics are dominated by the overheads of the MATLAB interpreter, file I/O and so forth. Note that as a general rule, if you increase the minibatch size you can increase the learning rate in proportion. This should give faster convergence.
Answers (0)
Categories
Find more on Parallel and Cloud in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!