MATLAB Answers

Cannot utilize fully all GPUs during network training

72 views (last 30 days)
Tomer Nahshon
Tomer Nahshon on 22 Dec 2019
Commented: Tomer Nahshon about 4 hours ago
It’s the performance and use of the resources installed on the Computer (Amazon Cloud EC2 instance in our case).
I am using a p3.8xlarge instance in EC2 on awamzon web server – basically it means I am using 4 GPUs V100,
I am training a neural network.
using:
mdl(i).Net = trainNetwork(trainData(:, :, :, 1: itStep: end), trainLabels(1: itStep: end, :), layers, options);
in options I define 'multi-gpu'
I also defined 'parallel' and tried to play with number of workers but all I see is just more processes waiting in the GPU queue on nvidia-smi.
For some reason I see that all GPU are working (see GPU.png) but for limited amount of time (very high usage for 3 seconds and then drops to 0% for 10 seconds at least.
I looked at the htop information(htop.jpg), I see that not all threads of the CPU are in use so that is the bottleneck (I think?)
I have a xeon processor on this instance with 32 cores (16 physical, 32 logical)
When I try to utilize all threads through local profile (profile local pool.png) it seems like it still doesn’t respond .
I get more workers because of it (CPU ?), but the GPUs still doesn't seem to improve
Tried to increase batch size - but at some point the GPU is out of memory, so that's not the problem.
How do i utilize all cores of the CPU to transfer data to the GPUs?
I read somewhere that you can also load the data to the pool itself? will that help?
I use the https://ngc.nvidia.com/catalog/containers/partners:matlab/tagsmatlab container for matlab:r2019a
I scanned these already:
Would appreciate your help.
Tomer

  6 Comments

Show 3 older comments
Tomer Nahshon
Tomer Nahshon on 23 Jan 2020 at 8:27
Hey Joss,
I saw a thread that Helped me.
I used the exact solution (mild changes regarding the size of the dimension of the .mat file.
Now I can use more data without worrying about my RAM.
I used this function since I have a regression problem just as stated in the thread.
function CombinedFileDatastore = CreateCombinedFDS(Path)
inputData=fileDatastore(Path,'ReadFcn',@load,'FileExtensions','.mat');
targetData=fileDatastore([Path, '/Labels'],'ReadFcn',@load,'FileExtensions','.mat');
inputDatat = transform(inputData,@(data) rearrange_datastore(data));
targetDatat = transform(targetData,@(data) rearrange_datastore(data));
CombinedFileDatastore=combine(inputDatat,targetDatat);
end
function image = rearrange_datastore(data)
image=data.temp;
image= {image};
end
Where Path holds all my input "*.mat" files and Path/Labels holds all my Labels.
1) For some reason it is not parallilzed datastore? why? I thought FileDatastore are Parallelized.
2) my GPU is still starved, is the related to the fact I am using 'ReadFcn' argument? how can I solve this to use all the threads available and not just the main Matlab Thread?
sent this to Matlab Support as well.
I attach here a short profileinfo.zip with my profiler as suggested by Matlab support
Thank you,
Tomer
Joss Knight
Joss Knight about 23 hours ago
File Datastore is partitionable but Combined Datastore isn't in R2019b. Does the DispatchInBackground training option work for your example? If not we may be talking about using a Custom datastore so you can do the file loading and transform on a parallel pool.
Tomer Nahshon
Tomer Nahshon about 1 hour ago
When trying to change the dispatchinbackground property in the options struct it gives me the error that dispatch in background is a read only property for adam solver (tried to change to SGDM as well, didn't work).
Enconutered this as well.
says that :
There are some limitations when using datastores with parallel training, multi-GPU training, and background dispatching:
  • Datastores do not support specifying the 'Shuffle' name-value pair argument of trainingOptions as 'none'.
  • Combined datastores are not partitionable and therefore do not support parallel training, multi-GPU training, or background dispatching.
Is there a way to solve my (custom) image regression problem in a more elegant way?
Like this guy just with parallelizing the datastore?
or I need to refer to custom datastore as you stated?
I also sent this corrspondence to Matlab support currently handling my issue.
Thanks alot Joss, you are very helpful,
Tomer

Sign in to comment.

Answers (0)

Sign in to answer this question.