Train Network on Amazon Web Services Using MATLAB Deep Learning Container
This example shows how to train a deep learning network in the cloud using MATLAB® on an Amazon EC2® instance.
This workflow helps you speed up your deep learning applications by training neural networks in the MATLAB Deep Learning Container on the cloud. Using MATLAB in the cloud allows you to choose machines where you can take full advantage of high-performance NVIDIA® GPUs. You can access the MATLAB Deep Learning Container remotely using a web browser or a VNC connection. Then you can run MATLAB desktop in the cloud on an Amazon EC2 GPU-enabled instance to benefit from the computing resources available.
To start training a deep learning model on AWS® using the MATLAB Deep Learning Container, you must:
For step-by-step instructions for this workflow, see MATLAB Deep Learning Container on NVIDIA GPU Cloud for Amazon Web Services.
To learn more and see screenshots of the same workflow, see the blog post https://blogs.mathworks.com/deep-learning/2021/05/03/ai-with-matlab-ngc/.
Semantic Segmentation in the Cloud
To demonstrate the compute capability available in the cloud, results are shown for a semantic segmentation network trained using the MATLAB Deep Learning Container cloud workflow. On AWS, the training was verified on a p3.2xlarge EC2 GPU enabled instance on an NVIDIA Tesla® V100 SMX2 with 16 GB of GPU memory. Training took around 70 minutes to meet the validation criterion, as shown in the training progress plot. To learn more about the semantic segmentation network example, see Semantic Segmentation Using Deep Learning.
Note that to train the semantic segmentation network using the Live Script example, change
Semantic Segmentation in the Cloud with Multiple GPUs
Train the network on a machine with multiple GPUs to improve performance.
When you train with multiple GPUs, each image batch is distributed between the GPUs. Distribution between GPUs effectively increases the total GPU memory available, allowing larger batch sizes. A recommended practice is to scale up the mini-batch size linearly with the number of GPUs to keep the workload on each GPU constant. Because increasing the mini-batch size improves the significance of each iteration, also increase the initial learning rate by an equivalent factor.
For example, to run this training on a machine with 4 GPUs:
In the semantic segmentation example, set
multi-gpuin the training
Increase the mini-batch size by 4 to match the number of GPUs.
Increase the initial learning rate by 4 to match the number of GPUs.
The following training progress plot shows the improvement in performance when you use multiple GPUs. The results show the semantic segmentation network trained on 4 NVIDIA Titan Xp GPUs with 12 GB of GPU memory. The example used the
multi-gpu training option with the mini-batch size and initial learning rate scaled by a factor of 4. This network trained for 20 epochs in around 20 minutes.
As shown in the following plot, using 4 GPUs and adjusting the training
options as described above results in a network that has the same validation accuracy but trains 3.5x faster.