GPU parallelization for matrices 4x4

7 views (last 30 days)
Petr
Petr on 2 May 2011
Hi everybody,
I have a set of about 1000 matrix equations. Each equation is just a multiplication of several 4x4 matrices (typicaly 5-10, same for all equations). I'd like to try parallel computation on GPU to speed up the process, but all examples I've found are for large matrices not for the set of small matrices. On the other hand GPU is designed exactly for handle with 4x4 matrices so I guess this could be ideal task for it. I'd like to know if there is a way to do it (note, I cannot use 'parfor' because I don't have 'Jacket'). Thanks, Petr
  4 Comments
James Tursa
James Tursa on 3 May 2011
Do you have separate variables named A1, A2, ... , A1000 etc? Or are they actually part of a cell array? Or 2D slices of a 4x4x1000 array?
James Tursa
James Tursa on 3 May 2011
You could use a custom mex function for this, but the implementation will depend on how you have organized the data per my previous question. Could also use OpenMP to explicitly parallelize it.

Sign in to comment.

Answers (2)

Eric Johnson
Eric Johnson on 3 May 2011
The Parallel Computing Toolbox for MATLAB includes support for both PARFOR loops and GPU computations. The GPU interface in the Parallel Computing Toolbox is a collection of high-level, easy to use GPU functions. The problem as described maps very well to GPU hardware, which likes calculations that can be partitioned into groups of 16 to 32 elements. So your set of 4x4 matrix multiplications is a perfect for the hardware.
However, this requires a finer level of control than what is available with the built-in GPU functions in the Parallel Computing Toolbox. The toolbox GPU functions are intended to accelerate calculations on large matrices. As you noted, your problem is a collection of N 4-by-4 matrices, where N is >= 1000. Alternatively, you can write a custom CUDA kernel in C. The toolbox has an interface that makes it easy to call CUDA kernels directly from MATLAB. This fastest option from a computational standpoint, but the slowest from a development standpoint.
Finally, as John noted, PARFOR is not ideal for this type of problem - the loop body is not computationally intensive enough exceed the parallelization overhead. Nevertheless, MATLAB performs automatic acceleration for you via multithreading. The example that you gave (i = 1:1000) requires only 7.4 ms on my dual-core notebook, and < 1s for N = 100000. As described, you need a very large collection of A{i}, B{i}, ..., E{i} matrices before the problem becomes slow enough for a human to notice!
N = 1000;
A = cell(N,1);
B = cell(N,1);
C = cell(N,1);
D = cell(N,1);
E = cell(N,1);
F = cell(N,1);
for k = 1:N
A{k} = rand(4,4);
B{k} = rand(4,4);
C{k} = rand(4,4);
D{k} = rand(4,4);
E{k} = rand(4,4);
end
tic
for k = 1:N
F{k} = A{k} * B{k} * C{k} * D{k} * E{k};
end
toc;
>> Elapsed time is 0.007438 seconds.

John Melonakos
John Melonakos on 3 May 2011
I'm one of the guys working on Jacket. To clarify, the only way to accomplish what you are looking to do (without writing a bunch of low-level CUDA code yourself) is via Jacket's GFOR loop. The GFOR loop is effective because it is able to run all the 4x4 operations in one big batch, thereby able to benefit from the data parallelism that GPUs can offer. PARFOR is not involved in that operation and would likely only slow down this problem due to communication overhead.
  1 Comment
Petr
Petr on 3 May 2011
Thanks, I guess GFOR loop is probably the solution (sorry for my mistake with PARFOR) but is there any non-Jacket solution, just using Parallel Computing Toolbox?

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!