# Improve Performance of Small Matrix Problems on the GPU using `pagefun`

This example shows how to use `pagefun` to improve the performance of applying a large number of independent rotations and translations to objects in a 3-D environment. This is typical of a range of problems which involve a large batch of calculations on small arrays.

GPUs are most effective when carrying out calculations on very large matrices. In MATLAB® this is usually achieved by vectorizing code to maximize the work done in each instruction. When you have a large data set but your calculations are divided into many small matrix operations, it can be challenging to maximize performance by running simultaneously on the many hundreds of GPU cores.

The `arrayfun` function allow scalar operations to be carried out in parallel on the GPU. The `pagefun` function adds the capability of carrying out matrix operations in batch in a similar way. The `pagefun` function is available in Parallel Computing Toolbox™ for use with gpuArrays.

In this example, a robot is navigating a known map containing a large number of features that the robot can identify using its sensors. The robot locates itself in the map by measuring the relative position and orientation of those objects and comparing them to the map locations. Assuming the robot is not completely lost, it can use any difference between the two to correct its position, for instance by using a Kalman Filter. We will focus on the first part of the algorithm.

### Set up the map

Let's create a map of objects with randomized positions and orientations in a large room.

```numObjects = 1000; roomDimensions = [50 50 5]; % Length * breadth * height in meters```

We represent positions and orientations using 3-by-1 vectors `T` and 3-by-3 rotation matrices `R`. When we have N of these transforms we pack the translations into a 3-by-N matrix, and the rotations into a 3-by-3-by-N array. The supporting function `randomTransforms` is provided at the end of this example and initializes N transforms with random values, providing a structure as output.

Use `randomTransforms` to set up the map of object transforms, and a start location for the robot.

```Map = randomTransforms(numObjects,roomDimensions); Robot = randomTransforms(1,roomDimensions);```

### Define the equations

To correctly identify map features the robot needs to transform the map to put its sensors at the origin. Then it can find map objects by comparing what it sees with what it expects to see.

For a map object $i$ we can find its position relative to the robot ${T}_{rel}\left(i\right)$ and orientation ${R}_{rel}\left(i\right)$ by transforming its global map location:

`$\begin{array}{l}{R}_{rel}\left(i\right)\phantom{\rule{0.2777777777777778em}{0ex}}=\phantom{\rule{0.2777777777777778em}{0ex}}{R}_{bot}^{\top }{R}_{map}\left(i\right)\\ {T}_{rel}\left(i\right)\phantom{\rule{0.2777777777777778em}{0ex}}=\phantom{\rule{0.2777777777777778em}{0ex}}{R}_{bot}^{\top }\left({T}_{map}\left(i\right)-{T}_{bot}\right)\end{array}$`

where ${T}_{bot}$ and ${R}_{bot}$ are the position and orientation of the robot, and ${T}_{map}\left(i\right)$ and ${R}_{map}\left(i\right)$ represent the map data. The equivalent MATLAB code looks like this:

```Rrel(:,:,i) = Rbot' * Rmap(:,:,i) Trel(:,i) = Rbot' * (Tmap(:,i) - Tbot) ```

### Perform many matrix transforms on the CPU using a `for` loop

We need to transform every map object to its location relative to the robot. The supporting function `loopingTransform` is provided at the end of this example and loops over all the transforms in turn. Note the `'like'` syntax for `zeros` which will allow us to use the same code on the GPU in the next section.

To time the calculation we use the `timeit` function, which will call `loopingTransform` multiple times to get an average timing. Since it requires a function with no arguments, we use the `@()` syntax to create an anonymous function of the right form.

```cpuTime = timeit(@()loopingTransform(Robot,Map,numObjects)); fprintf('It takes %3.4f seconds on the CPU to execute %d transforms.\n', ... cpuTime, numObjects);```
```It takes 0.0047 seconds on the CPU to execute 1000 transforms. ```

### Try the same code on the GPU

To run this code on the GPU is merely a matter of copying the data into a `gpuArray`. When MATLAB encounters data stored on the GPU it will run any code using it on the GPU as long as it is supported.

```gMap.R = gpuArray(Map.R); gMap.T = gpuArray(Map.T); gRobot.R = gpuArray(Robot.R); gRobot.T = gpuArray(Robot.T);```

Now we call `gputimeit`, which is the equivalent of `timeit` for code that includes GPU computation. It makes sure all GPU operations have finished before recording the time.

`fprintf('Computing...\n');`
```Computing... ```
```gpuTime = gputimeit(@()loopingTransform(gRobot,gMap,numObjects)); fprintf('It takes %3.4f seconds on the GPU to execute %d transforms.\n', ... gpuTime, numObjects);```
```It takes 0.2745 seconds on the GPU to execute 1000 transforms. ```
```fprintf(['Unvectorized GPU code is %3.2f times slower ',... 'than the CPU version.\n'], gpuTime/cpuTime);```
```Unvectorized GPU code is 58.19 times slower than the CPU version. ```

### Batch process using `pagefun`

The GPU version above was very slow because, although all calculations were independent, they ran in series. Using `pagefun` we can run all the computations in parallel. The supporting function `pagefunTransform` is provided at the end of this example and applies the same transforms as the `loopingTransform` function using `pagefun` instead of a `for`-loop.

```gpuPagefunTime = gputimeit(@()pagefunTransform(gRobot, gMap)); fprintf(['It takes %3.4f seconds on the GPU using pagefun ',... 'to execute %d transforms.\n'], gpuPagefunTime, numObjects);```
```It takes 0.0003 seconds on the GPU using pagefun to execute 1000 transforms. ```
```fprintf(['Vectorized GPU code is %3.2f times faster ',... 'than the CPU version.\n'], cpuTime/gpuPagefunTime);```
```Vectorized GPU code is 17.88 times faster than the CPU version. ```
```fprintf(['Vectorized GPU code is %3.2f times faster ',... 'than the unvectorized GPU version.\n'], gpuTime/gpuPagefunTime);```
```Vectorized GPU code is 1040.61 times faster than the unvectorized GPU version. ```

### Explanation

The first computation was the calculation of the rotations. This involved a matrix multiply, which translates to the function `mtimes` (`*`). We pass this to `pagefun` along with the two sets of rotations to be multiplied:

```Rel.R = pagefun(@mtimes, Robot.R', Map.R); ```

`Robot.R'` is a 3-by-3 matrix, and `Map.R` is a 3-by-3-by-N array. The `pagefun` function matches each independent matrix from the map to the same robot rotation, and gives us the required 3-by-3-by-N output.

The translation calculation also involves a matrix multiply, but the normal rules of matrix multiplication allow this to come outside the loop without any changes:

```Rel.T = Robot.R' * (Map.T - Robot.T); ```

### More advanced GPU vectorization - Solving a "lost robot" problem

If our robot was in an unknown part of the map, it might use a global search algorithm to locate itself. The algorithm would test a number of possible locations by carrying out the above computation and looking for good correspondence between the objects seen by the robot's sensors and what it would expect to see at that position.

Now we have got multiple robots as well as multiple objects. N objects and M robots should give us N*M transforms out. To distinguish 'robot space' from 'object space' we use the 4th dimension for rotations and the 3rd for translations. That means our robot rotations will be 3-by-3-by-1-by-M, and the translations will be 3-by-1-by-M.

We initialize our search with random robot locations. A good search algorithm would use topological or other clues to seed the search more intelligently.

```numRobots = 10; Robot = randomTransforms(numRobots,roomDimensions); Robot.R = reshape(Robot.R, 3, 3, 1, []); % Spread along the 4th dimension Robot.T = reshape(Robot.T, 3, 1, []); % Spread along the 3rd dimension gRobot.R = gpuArray(Robot.R); gRobot.T = gpuArray(Robot.T);```

A supporting function `loopingTransform2` is defined at the end of this example and performs a looping transform using two nested loops, to loop over the robots as well as over the objects.

```cpuTime = timeit(@()loopingTransform2(Robot,Map,numObjects,numRobots)); fprintf('It takes %3.4f seconds on the CPU to execute %d transforms.\n', ... cpuTime, numObjects*numRobots);```
```It takes 0.0852 seconds on the CPU to execute 10000 transforms. ```

For our GPU timings we use `tic` and `toc` this time, because otherwise the calculation would take too long. This will be precise enough for our purposes. To ensure any cost associated with creating the output data is included, we are calling `loopingTransform2` with a single output variable, just as `timeit` and `gputimeit` do by default.

`fprintf('Computing...\n');`
```Computing... ```
```tic; gRel = loopingTransform2(gRobot,gMap,numObjects,numRobots); gpuTime = toc; fprintf('It takes %3.4f seconds on the GPU to execute %d transforms.\n', ... gpuTime, numObjects*numRobots);```
```It takes 3.6775 seconds on the GPU to execute 10000 transforms. ```
```fprintf(['Unvectorized GPU code is %3.2f times slower ',... 'than the CPU version.\n'], gpuTime/cpuTime);```
```Unvectorized GPU code is 43.17 times slower than the CPU version. ```

As before, the looping version runs much slower on the GPU because it is not doing calculations in parallel.

The new `pagefun` version needs to incorporate the `transpose` operator as well as `mtimes` into a call to `pagefun`. We also need to `squeeze` the transposed robot orientations to put the spread over robots into the 3rd dimension, to match the translations. Despite this, the resulting code is considerably more compact. A supporting function `pagefunTransform2` is provided at the end of this example and applies the same transforms as the `loopingTransform2` function using two `pagefun` calls instead of nested `for`-loops.

Once again, `pagefun` expands dimensions appropriately. So where we multiply 3-by-3-by-1-by-M matrix `Rt` with 3-by-3-by-N-by-1 matrix `Map.R`, we get a 3-by-3-by-N-by-M matrix out.

```gpuPagefunTime = gputimeit(@()pagefunTransform2(gRobot,gMap)); fprintf(['It takes %3.4f seconds on the GPU using pagefun ',... 'to execute %d transforms.\n'], gpuPagefunTime, numObjects*numRobots);```
```It takes 0.0012 seconds on the GPU using pagefun to execute 10000 transforms. ```
```fprintf(['Vectorized GPU code is %3.2f times faster ',... 'than the CPU version.\n'], cpuTime/gpuPagefunTime);```
```Vectorized GPU code is 72.28 times faster than the CPU version. ```
```fprintf(['Vectorized GPU code is %3.2f times faster ',... 'than the unvectorized GPU version.\n'], gpuTime/gpuPagefunTime);```
```Vectorized GPU code is 3120.72 times faster than the unvectorized GPU version. ```

### Conclusion

The `pagefun` function supports a number of 2-D operations, as well as most of the scalar operations supported by `arrayfun`. Together, these functions allow you to vectorize a range of computations involving matrix algebra and array manipulation, eliminating the need for loops and making huge performance gains.

Wherever you are doing small calculations on GPU data in a loop, you should consider converting to a batch implementation in this way. This can also be an opportunity to make use of the GPU to improve performance where previously it gave no performance gains.

### Supporting Functions

#### Random Transform Function

The `randomTransforms` function sets up a map of object transforms and a start location for the robot for a number of objects in a room of specified dimensions.

```function Tform = randomTransforms(N,roomDimensions) Tform.T = zeros(3, N); Tform.R = zeros(3, 3, N); for i = 1:N Tform.T(:,i) = rand(3, 1) .* roomDimensions'; % To get a random orientation, we can extract an orthonormal % basis for a random 3-by-3 matrix. Tform.R(:,:,i) = orth(rand(3, 3)); end end```

#### Looping Transform Function

The `loopingTransform` function transforms every object to its location relative to the robot by looping over the transforms in turn.

```function Rel = loopingTransform(Robot,Map,numObjects) Rel.R = zeros(size(Map.R), 'like', Map.R); % Initialize memory Rel.T = zeros(size(Map.T), 'like', Map.T); % Initialize memory for i = 1:numObjects Rel.R(:,:,i) = Robot.R' * Map.R(:,:,i); Rel.T(:,i) = Robot.R' * (Map.T(:,i) - Robot.T); end end```

#### `pagefun` Transform Function

The `pagefunTransform` function transforms every object to its location relative to the robot by applying the transforms using the `pagefun` function.

```function Rel = pagefunTransform(Robot,Map) Rel.R = pagefun(@mtimes, Robot.R', Map.R); Rel.T = Robot.R' * (Map.T - Robot.T); end```

#### Nested Looping Transform Function

The `loopingTransform2` function performs a looping transform using two nested loops, to loop over the robots as well as over the objects. The transforms map every object to its location relative to every robot.

```function Rel = loopingTransform2(Robot,Map,numObjects,numRobots) Rel.R = zeros(3, 3, numObjects, numRobots, 'like', Map.R); Rel.T = zeros(3, numObjects, numRobots, 'like', Map.T); for i = 1:numObjects for j = 1:numRobots Rel.R(:,:,i,j) = Robot.R(:,:,1,j)' * Map.R(:,:,i); Rel.T(:,i,j) = ... Robot.R(:,:,1,j)' * (Map.T(:,i) - Robot.T(:,1,j)); end end end```

#### Two-call `pagefun` Transform Function

The `pagefunTransform2` function performs transforms to map every object to its location relative to every robot using two calls to the `pagefun` function.

```function Rel = pagefunTransform2(Robot, Map) Rt = pagefun(@transpose, Robot.R); Rel.R = pagefun(@mtimes, Rt, Map.R); Rel.T = pagefun(@mtimes, squeeze(Rt), ... (Map.T - Robot.T)); end```