MATLAB Answers

GPU and Memory Utilization in Deep Learning

38 views (last 30 days)
Alexandre Kozlov
Alexandre Kozlov on 29 Sep 2020 at 22:51
Edited: Alexandre Kozlov on 6 Oct 2020 at 8:06
Trying to run a large NN training in Matlab. First, even if I give the 'auto' option to the 'ExecutionEnvironment' option it somehow chooses only one GPU. Even after this, it gives me a warning:
Warning: GPU is low on memory, which can slow performance due to additional data transfers with main memory. Try reducing the 'MiniBatchSize' training option. This warning will not appear again unless you run the command: warning('on','nnet_cnn:warning:GPULowOnMemory').
Plenty of memory is available in both OS RAM and other GPU's RAM (I am running on a 4 GPU DGX Station). In addition, even if I force 'multi-gpu' training with the above option, both, memory and GPU utilizations are very low. In addition, as I understand, the last option requires Parallel Computing Toolbox, which makes the execution even slower due to initialization times (I found running multiple Matlabs on just different GPUs with CUDA_VISIBLE_DEVICES to be a factor faster). The situation is even more dire on 8x or 16x GPU DGX NVIDIA products.
Running the code in Tensorflow or PyTorch with mpi is way faster.
Looking at GPU utilization, as measured by nvtop, the GPUs do not seem to be buzy (on a 8 way DGX1 Station running workload). The Memory utilization seems reasonable (< 50% except the GPU # 0). In practive, even if I do get ~ 2x speedup on 2-4 GPUs, when I get to 8 thing get back to the speeds of 1 GPU, which is about ~6 seconds per image in my case on a NN close to InceptionV3).
Any hints on how to improve performance?

  2 Comments

Joss Knight
Joss Knight on 30 Sep 2020 at 8:40
Hi Alexandre. Your experience with the DGX Station is contrary to mine, which I found was able to scale up training performance by nearly 4x when training a model with 'multi-gpu' selected. So I guess we need to bury down and figure out what's different about your model and setup.
First of all, 'auto' doesn't automatically choose 'multi-gpu', you have to select it.
Secondly, MATLAB's multiple GPU model involves multiple processes, one per GPU. Each process has access to one GPU's memory (except where MPI is used to combine gradients), which in this case is the memory of a single V100, not all 4 of them. In Deep Learning workflows it's easy to suck up even tens of gigabytes of memory. For instance, if you set your MiniBatchSize training option very high you may not even be able to store a single batch of data in memory after loading it from disk. But what seems to be happening here is you don't have enough memory to hold all the activations from every layer during training (needed for backpropagation), and that data needs to be paged back to system memory. That probably means you have a very deep model. Again, you can reduce memory impact by reducing the MiniBatchSize. This often speeds things up because memory throughput can be vastly increased even in smaller batches.
You'll have to explain more what you mean by initialization times. Do you mean when MATLAB is starting training and computing the input normalization? This is often slow because the entire dataset needs to be processed, and you can avoid it by computing the normalization offline and using a Transformed Datastore to apply it, and set the imageInputLayer's Normalization parameter to 'none' instead. Still, normalization certainly should be faster in multi-gpu model, with 4 process all processing a quarter of the data and then combining the results over MPI.
Alexandre Kozlov
Alexandre Kozlov on 6 Oct 2020 at 2:10
Just added a few screenshots. I got ~ 6 seconds/iteration on a single GPU and get back to approximately the same number when I try using 8 GPU DGX1. The network is close to InceptionV3.

Sign in to comment.

Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!