Kishen Mahadevan, MathWorks
Learn how to use GPU Coder™ with Simulink® to design, verify, and deploy your deep learning application onto a NVIDIA Jetson® board. See how to use advanced signal processing and deep neural networks to classify ECG signals. Start the workflow by running a simulation on the desktop CPU before switching to the desktop GPU for acceleration. Once the simulation is complete and the results meet your requirements, use software-in-the-loop testing on the desktop GPU and processor-in-the-loop testing on the Jetson board to numerically verify that the generated CUDA® code results match the Simulink simulation. Finally, deploy the complete application as optimized CUDA code to the Jetson board.
In this demo, we use deep learning to classify ECG data to determine if a person is suffering from cardiac arrhythmia or congestive heart failure. We performed this classification using a pretrained deep learning network and Simulink, and then generate and deploy the complete application to a Jetson board.
The workflow for this example has four parts. First, we'll run desktop simulation on a CPU, then accelerate it using a desktop and media GPU. Then we'll verify the generated code by running software-in-the-loop testing on desktop GPU. We'll go one step further to verify the generated target code by performing processor-in-the-loop testing on the Jetson board. Finally, we take a complete application, generate CUDA code and run it on the Jetson AGX Xavier.
Let's have a quick look at the Simulink model. There are multiple subsystems, so starting from the left, the input data is ECG data obtained from patients with cardiac arrhythmia, congestive heart failure, and normal sinus rhythm. The ECG preprocessing subsystem performs continuous wavelet transform on the input data and outputs images and the time frequency representation. These images are then processed and input to the pretrained network to classify the ECG waveforms.
To use deep learning networks and Simulink, we can use blocks from the deep learning library or the enhanced MATLAB function block. For this example, we use the image classifier block to represent the pretrained SqueezeNet. ECG post-processing subsystem then finds the label for the ECG signal based on the prediction score from the SqueezeNet and outputs the scalogram image with label and confidence score printed on it.
Let's go ahead and start the simulation. We are currently running the simulation on the CPU, and we can see that SqueezeNet is able to classify the data and show that confidence scores. Execution time is about 9.2 seconds. We can use our desktop and video GPU to accelerate the simulation. In configuration settings on the simulation target, select the GPU acceleration checkbox and set the deep learning target library the cuDNN. With these settings, Simulink will identify and run parts of the Simulink model, including the deep learning network on the GPU.
Let's go ahead and run the simulation. We see the same results as before, but with a nice boost in speed. Execution time using the GPU as about 2.6 seconds, which is about three times faster than using just the CPU. Now that we're happy with simulation results, let's move to deploying the algorithm on hardware.
One of the first steps we can do is to numerically verify that the behavior of the generated code matches that of the simulation. To do this, we will run software-in-the-loop testing. Let's first group the preprocessing, pretrained deep learning network, and post-processing parts into a single subsystem. This will make running SIL easier.
To perform software-in-the-loop testing, go to the hardware implementation tab and configuration settings, set hardware board to none, and set the device details to match your host machine. In our case, the host machine has a 64-bit Intel processor. To create a cell block that contains the generated CUDA code, go to the generation settings and enable the generate GPU code checkbox. Then navigate to the verification tab under code generation, and under advanced parameters, set to create block option to SIL.
Let's go ahead and generate CUDA code for the subsystem that contains the pretrained network. The build process is successful, and a SIL block is created during simulation. The SIL block runs the generated CUDA code as a separate process on the Intel processor and desktop GPU. Now let's
incorporate this SIL block with a previous Simulink model with an NVIDIA subsystem for ease of comparison with desktop simulation. On simulating the Simulink model with the SIL setup, we see that classification results match that of the desktop simulation.
With the generated code on the host CPU and GPU verified, let's move to the next step, where we verify the generator target code on an NVIDIA Jetson board. To run processor-on-the-loop testing, go to the hardware implementation tab and configuration settings and set the hardware board to NVIDIA Jetson. From there, under target hardware resources, specify board parameters, like device address, username, and password so Simulink can access it. To create a PIL block that contains the generated target CUDA code, navigate to the verification tab under code generation settings. Under advanced parameters, set the create block dropdown to PIL.
With these settings configured, let's build and deploy to the Jetson board. We can see that the deployment process is successful and a PIL block was created. During simulation, PIL block downloads and runs the object code on target hardware. Now let's incorporate this PIL block with the previous Simulink model. On simulating the Simulink model with the PIL block, we see that the complete application is running on Jetson, and the results match with that of desktop simulation and SIL mode.
Now that we've verified the generated code on the host and target, we can only retarget the complete application to NVIDIA Jetson board for standalone execution. When generating code, we can set GPU CUDA to use NVIDIA's optimization libraries for the best performance. With deep learning networks, both cuDNN and TensorRT are supported. For the rest of the algorithm, cuBLAS, cuFFT, cuSOLVER, and other CUDA libraries are automatically used.
Let's stick with the default settings for our example. With the hardware board set to NVIDIA Jetson and the build configuration set to build and run, let's go ahead and click on Build, Deploy, and Start button. Here, you can see the output from the diagnostic viewer showing us that we have compiled the application and it's running on the Jetson Xavier. The SDL video display here streams the output from the Jetson board. And we can see that the displayed results match the simulation results.
So quick recap of the workflow we followed with the classification of ECG signals example. We started with desktop simulation on a CPU, and showcased how we can accelerate it using desktop GPU. We then verified the generated network code on our desktop CPU and GPU by performing software-on-the-loop testing. We also ran processor-on-the-loop testing to verify the generated target code on our Jetson board. Finally, we deployed the complete classification application to the Jetson in standalone execution mode and observe the output.
You can perform a similar workflow to verify and deploy your deep learning application using GPU CUDA from Simulink. For more information, take a look at the links below.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .Select web site
You can also select a web site from the following list:
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.