Trying to run a large NN training in Matlab. First, even if I give the 'auto' option to the 'ExecutionEnvironment' option it somehow chooses only one GPU. Even after this, it gives me a warning:
Warning: GPU is low on memory, which can slow performance due to additional data transfers with main memory. Try reducing the 'MiniBatchSize' training option. This warning will not appear again unless you run the command: warning('on','nnet_cnn:warning:GPULowOnMemory').
Plenty of memory is available in both OS RAM and other GPU's RAM (I am running on a 4 GPU DGX Station). In addition, even if I force 'multi-gpu' training with the above option, both, memory and GPU utilizations are very low. In addition, as I understand, the last option requires Parallel Computing Toolbox, which makes the execution even slower due to initialization times (I found running multiple Matlabs on just different GPUs with CUDA_VISIBLE_DEVICES to be a factor faster). The situation is even more dire on 8x or 16x GPU DGX NVIDIA products.
Running the code in Tensorflow or PyTorch with mpi is way faster.
Looking at GPU utilization, as measured by nvtop, the GPUs do not seem to be buzy (on a 8 way DGX1 Station running workload). The Memory utilization seems reasonable (< 50% except the GPU # 0). In practive, even if I do get ~ 2x speedup on 2-4 GPUs, when I get to 8 thing get back to the speeds of 1 GPU, which is about ~6 seconds per image in my case on a NN close to InceptionV3).
Any hints on how to improve performance?