Amit Goel, NVIDIA
Deep learning is transforming a diverse set of engineering and scientific domains including computer vision, video analytics, robotics, autonomous driving, and more. Deep learning can achieve state-of-the-art accuracy for many tasks considered algorithmically unsolvable using traditional machine learning. In this presentation, real-world examples are used to illustrate how deep learning is implemented in a diverse set of applications. Demonstrations illustrate how MATLAB® and NVIDIA® GPUs are enabling these innovations.
Topics covered include advances to AI computing at the edge through NVIDIA Jetson platform, and the ability to automatically generate high-performance CUDA® code for NVIDIA GPUs from deep learning models in MATLAB.
So I'm Amit Goel. I'm the product manager for Intelligent Machines at NVIDIA. And you've heard a lot about deep learning and AI. I'll give you a little bit more broader overview of where things are going with AI, and specifically for autonomous machines, and the role NVIDIA is playing together with MATLAB to accelerate that.
So we are seeing a fundamental shift in the way we do computing today. The era of microprocessors is tapering off, saturating even though technology is advancing. But the new computing model of GPU computing is accelerating the data processing at rates that we have not seen before.
And combined with the advances in specifically deep learning, and the data coming from the mobile internet, and the processing power that is being provided by the GPUs, we are able now able to solve problems that were previously considered unsolvable. And NVIDIA is at the heart and center of this whole AI revolution and the new computing model. We have products that range from data center all the way to the edge, and for both inferencing and training for AI.
So to give you an overview of some of the high-level problems that are being solved with AI which were previously—nobody even tried to solve them using traditional algorithms. Here are a few of them that I would like to highlight. The first one is a product from NVIDIA where we are using AI and deep learning to render an image which would be done using ray tracing in the past. And as you—if you have worked with ray tracing, you know that rendering an image with—every pixel with ray tracing, trying to identify the trajectory of each photon of light takes a long time. Using deep learning, you can now render only a few pixels and let the deep learning algorithm fill in the rest of the pixels. The second one is the inverse of what we call lip reading. So you give it the text, and now using a deep learning algorithm which has been trained on videos, it's able to generate motion—your facial expressions that you would need to articulate that particular text.
So that is something that's really important for things like game engines, where you have to have a lot of, you know, text, and the characters animating—speaking things. Third one is brought from a partner called wrnch, where using—it's been trained on 2D videos, and it's able to now estimate the human pose in 3D world. And you can imagine how important this is for VR and robotics, where you can interact with the robot in the 2D world, and it can estimate your pose and teach the robot.
The fourth one is a project from the University of Edinburgh, where they taught this character to animate itself in order to navigate obstacles. So no matter what obstacles you put on the spot, it's been trained on videos, and it knows if it is a very narrow edge, it needs to tiptoe—if it's a high block, it needs to raise its leg up. And all of that has completely been done using deep learning training.
And the last one is an example of a robot where in Berkeley, the professors went in virtual reality and taught the robot how to stack up cubes. And with imitation learning, with very little amount of training, they were able to train this robot to now stack up cubes mimicking what the human did. So all of these are extremely, extremely hard problems that could not have been solved if you were not using AI and deep learning.
And specifically focusing on intelligent machines, there is a huge, huge opportunity that lies ahead of us for using AI and deep learning, and I'll highlight a few of them here. Factory automation—only 10% of our jobs in factory automation today are done by robots. Imagine if we can create collaborative robots that understand the world and that understand the tasks as well as we do.
We can tremendously increase this contribution and reach productivity levels that we have not seen before. Last mile delivery, supporting human in old age homes, agriculture inspection, and medical operations are some of the areas where today, there is huge opportunity for AI and deep learning to come in and add a lot of value. So in order to solve these problems, you need a solution that has the power, and performance, and a form factor that you can deploy it on the edge.
So for that we created what we call Jetson TX2, which is a credit card-sized module, and it has all of these things that you would need for your edge device. It has computation capabilities more than two Core i7s and it can do that in less than 10 watts. And as you can see, the form factor of that is that of a credit card-sized module.
So you can really—it unleashes a lot of opportunities on what you can do for your autonomous machines on the edge. This module comes with the integrated GPU, CPU. It has memory, connectivity, hardware accelerators for video encode-decode, camera, and most importantly the form factor, which is the 50 by 87 that's a credit card-sized module.
So we have—you heard about self-driving cars, but there is a lot of adoption that we are seeing for AI and deep learning in various aspects of the industries, whether it's manufacturing, agriculture, construction, inventory management, social, delivery, security. All these are applications that are seeing a tremendous value using AI and deep learning. And to show that, let me play the next video here.
So as you saw, Jetson is being deployed for a lot of different applications for autonomous machines. And all the way from Fortune 500 companies to startups, they're all able to use this platform. And not just for industries. Jetson is also allowing academics and research to tap into these applications.
And here are some examples of recent events where the researchers and students have used Jetsons. The Amazon robotics challenge, which is a target application for warehouse automation for pick-and-place specifically, every team in the top 10 was using GPU accelerated deep learning for their solution, and of course, the winner was also using that. And all the way down to high school, the first robotics competition, there are teams that are using the Jetson boards on their robots in order to achieve the goals of the first robotics competition.
Toyota has created this human support robot which is providing two different universities for research in order to solve the problem of growing old population that needs support from robots. And in RoboCup, there were several teams that used humanoid—and this year, they changed the ball, and people who were using traditional solutions could not really figure that out. But people who were using deep learning were able to quickly train their algorithms and deploy it on their robots.
So all the way from Fortune 500 companies to high school students, everyone is able to make use of this platform. And what makes that possible is our software stack. Starting with the GPU accelerated Jetson platform at the bottom, we have built up this extensive stack which can support all these various applications that you need for autonomous machines.
It runs the same CUDA architecture, and CUDA and deep learning libraries that you can run on your discrete GPU. The same libraries run on the small form factor Jetson board. And we have libraries for deep learning, computer vision, graphics, and media.
On top of that, we have built in multimedia APIs and tools too for development. So this whole stack is available, and it's open source. We make it available for everybody to use, and that is what enables companies, big or small, and researchers to tap into the potential of this platform and develop solutions.
And today, we're going to talk how you can use all of this right from within MATLAB, thanks to the GPU Coder project, you can tap—which directly communicates with all the entire software stack that we've built on top of Jetson, and gives you access to all of it from within MATLAB. And to talk more about that, I'll invite Avi here.
Thanks, Amit. So the question you probably have in the audience is, how do we target this amazing hardware that Amit talked about? And with that, I'd like to introduce a new product that was just out with our 17b release in September that generates CUDA code to be used on an NVIDIA GPU. And the way it works is you take your MATLAB algorithm, you use GPU Coder, the new product, and that gives you CUDA code.
Not only does it give you a CUDA code, but it also gives you the glue C and C++ code to stitch it together. And that is deployed in parallel form on as different CUDA kernels onto your GPU cores as well as C and C++ code that actually runs on the ARM Cortex part on these SoCs. So why use GPU Coder? Well, if you're doing deep learning, GPU Coder's performance is seven times faster than the state of the art—libraries like TensorFlow.
If you're doing computer vision algorithm, traditional computer vision, we think speed UPS of up to 700x from just using your regular MATLAB or C code. And for signal processing applications, we see up to a 20x speedup. So, in a word, why use GPU Coder? Performance.
And how fast is GPU Coder? So we did some internal benchmarking, where we benchmarked a bunch of standard computer vision algorithms, things like SURF feature extractions, stereo disparity, et cetera. And we found orders of magnitude speed up over optimized C code.
So what's the workflow to get all of this to work? So you start with a MATLAB algorithm. This is usually a functional reference. This is your golden reference. You design your algorithm with the use of MATLAB. I showed you a lot of that in the previous talk.
Now once you want to deploy or test your deployment, you can always test your deployment on a desktop GPU. In this case, it's a Tesla generation GPU core. You could then use GPU Coder to actually create the standalone CUDA code, but then actually test it in MATLAB by calling it as a MEX, and I'll show you a demo of this in a second.
You can then actually separate the CUDA code from MATLAB, integrate it in a C++ application, and do deployment integration test. And when you're happy with that, you can actually take that code and deploy it onto a Jetson platform, and actually run it on the GPU. So what does that workflow look like?
So given that deep learning is really driving the interest behind GPUs, I'm going to show you the Hello World of deep learning applications, and that is doing a 1,000 class image classification using a network called AlexNet, which is a popular research network that was created in about 2012. So let me let me play that video.
So you can see we're taking a snapshot from a camera. We're resizing it in a way that fits—that will work with AlexNet. And we are benchmarking how fast it's running. So right now, this is running on a desktop GPU. You can see it's running at about 200 FPS, which is pretty damn fast.
But if we want to generate code, we go to the Apps Gallery, we select the GPU Coder app. We tell it which function we want to generate code from—so that's the AlexNet predict function. Now since MATLAB is loosely typed, we actually have to tell it the input size because C++ and CUDA are not loosely typed.
So we say, okay, it's a single, it's a 227 by 227, and it's an RGB image, so we have three channels. And when that's done, we hit Next. Hit Next again. Ask it to generate a MEX file so we can test it within MATLAB. t willsay generated C++ code. And when that's done, you get this nice report where you can actually compare your—trace your code back to your MATLAB code. And you can see you have a mixture of C code and then calls to CUDA, so you can see the CU files that has generated CUDA code.
Now all of this follows the same code generation workflow, that if you've used MATLAB Coder, it follows the same workflow. GPU Coder is built on top of MATLAB Coder that generates C and C++ code. And you can see there is the MEX file that we just generated.
Now let's see how fast that runs. So we swap out the predict function for the MEX production function, and let's run the entire script. And you see now it's significantly faster. Now it can go at up—I think it hits about 500 frames per second at its peak, which is obviously about twice the speed that we had just running on a desktop—on the desktop GPU when connected to MATLAB.
Now the previous example I showed you was all running on the desktop GPU, but what do you have to do if you wanted to target the Jetson that Amit talked about? So you have to make two really small changes. You need to change your build type from MEX to a static library, and you cross compile using the Jetson tool chain. Those are all the optimized libraries that Amit mentioned.
And when you're done with that, you can see that actually is the workflow to deploy it onto the Jetson. And you can see this is running at 30 FPS on the low-power embedded Jetson board. And again, it's the same AlexNet network that's running on the Jetson board.
So I know early on, I made some broad claims that you use GPU Coder and you use MATLAB for performance, so let me actually show you some benchmarks. So just to orient what you're looking at. What you're looking at on the x-axis is the batch size. So that's the number of images that we're passing through the deep network.
On the y-axis, that's the throughput, that's the frames per second, that's the number of images the network is able to process per second. Now if you look closely, there's a couple of things I want to point out. Firstly, just the MATLAB desktop which is the green bar there runs significantly faster, especially at the higher batch sizes, than other libraries like Caffe and TensorFlow.
The compiled code from GPU Coder, however, is at any batch size significantly faster than any of the other frameworks out there. And we got the same kind of performance boost. Now because we were running on the embedded board, we were not able to run all of those other libraries on the Jetson.
However, the C++ version of Caffe has a fairly small footprint—we were able to run it on the Jetson. And if you look at the benchmark, the performance is close at a batch size of 1, but as the batch size increases as you're passing more images through the network, GPU Coder is significantly faster than even C++ Caffe.
It is important to note that all—when we do the benchmarking, they're all calling against the same libraries. They're calling into the same versions of cuDNN, et cetera, on the board. So you might be skeptical, and you might ask, why is it faster than these open source deep learning frameworks that have many developers working on it?
So there's a few reasons for this. Firstly, the open source deep learning frameworks, they do many different things. They do training, they have support for visualizations, different data types, et cetera. They also have the overhead of Python. If you're using, say, TensorFlow, you have that Python overhead that's also running.
The generated code from MATLAB Coder or GPU Coder in this instance is just deploying the math operations for that specific deep neural network with very specific data types. So this really reduces the overhead and increases the efficiency, which is what gives us the performance that we get. Another reason we get the performance that we get is we link into all the most efficient libraries, not just the software stack that Amit talked about. In addition to cuDNN, which is used for deep learning, we also link into optimized libraries for FFTs, solvers, BLAS for math operations, et cetera.
Now GPU Coder is a new product, so if you haven't used it before, there is a ton of examples to help you learn how to use GPU Coder to do the benchmarking that we just showed you, integrating into Simulink, and performing many different image processing and computer vision applications. And lastly, before we open this up to questions, what we'd like you to take away is, in 17b with the release of GPU Coder, with the work that we've done with Amit's team, you can very easily target the Jetson TX2 from MATLAB. And that, for embedded deployment, will give you the best-in-class performance for deep learning.
Now we do have some demos at the deep learning booth. If you want to have more discussions on how you can incorporate this workflow, and how you can convert your MATLAB code to CUDA code, we'd be happy to answer any questions. The folks at NVIDIA have also agreed to give anybody who signs up a 50% discount on the Jetson TX2. So if you see somebody at the booth, just tell them that you are looking for the discount code for the TX2, and we'll send that to you afterwards. And that's our last line. I think we have a little time for questions.
You can also select a web site from the following list:
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.