# How to make sense of GPU timing results?

6 views (last 30 days)
Martin Kuhnel on 7 Aug 2019
Commented: Joss Knight on 10 Aug 2019
Hello,
I have a Matlab function that returns the wavefunction given a matrix w and vectors a, b, and v. The matrix is variable in size but at most 100x200 in which case a and v are 100x1 and b is 200x1
function wave = wave_fun(v, a, b, w)
wave = -(a.'*v)+sum(log(cosh(w.'*v+b)));
end
This function is quick but must be called up to a million times in the overarching algortihm so I would like to speed it up on a GPU. The first method I am trying is to just use gpuArray to put v, a, b, and w onto the GPU. They should be able to stay on the GPU throughout the overarching algorithm, so I should only need to do this once, not every time the function is called. For a second approach, I wrote a CUDA implementation of this code and loaded it as a kernel into Matlab. Now I am attempting to benchmark these methods but am getting some interesting results. The code is below:
N = 100;
M = 200;
dev = gpuDevice;
cudaname = 'wave.cu';
ptxname = 'wave.ptx';
kernel = parallel.gpu.CUDAKernel(ptxname,cudaname);
kernel.GridSize = [1,1];
w_mat = make_w(N,M);
a_vec = make_a(N);
b_vec = make_b(M);
v_vec = rand(N,1);
for i = 1:N
if v_vec(i) > 0.5
v_vec(i) = 1;
else
v_vec(i) = -1;
end
end
tic
wave_cpu = wave_fun(v_vec,a_vec,b_vec,w_mat);
cpu_time1 = toc
cpu_time2 = timeit(@() wave_fun(v_vec,a_vec,b_vec,w_mat))
v_gpu = gpuArray(v_vec);
a_gpu = gpuArray(a_vec);
b_gpu = gpuArray(b_vec);
w_gpu = gpuArray(w_mat);
tic
wave_gpu1 = wave_fun(v_gpu,a_gpu,b_gpu,w_gpu);
wait(dev)
gpu1_time1 = toc
gpu1_time2 = gputimeit(@() wave_fun(v_gpu,a_gpu,b_gpu,w_gpu))
w_flat = reshape(w_mat,1,[]);
wr_gpu = gpuArray(real(w_flat));
ar_gpu = gpuArray(real(a_vec).');
br_gpu = gpuArray(real(b_vec).');
wi_gpu = gpuArray(imag(w_flat));
ai_gpu = gpuArray(imag(a_vec).');
bi_gpu = gpuArray(imag(b_vec).');
v_gpu2 = gpuArray(v_vec.');
wave_gpu2 = zeros(1,2,'gpuArray');
tic
wave_gpu2 = feval(kernel,wave_gpu2,v_gpu2,ar_gpu,ai_gpu,br_gpu,bi_gpu,wr_gpu,wi_gpu,N,M);
wait(dev)
gpu2_time1 = toc
gpu2_time2 = gputimeit(@() feval(kernel,wave_gpu2,v_gpu2,ar_gpu,ai_gpu,br_gpu,bi_gpu,wr_gpu,wi_gpu,N,M))
On the NIVIDIA Quadro M4000, I get these results:
cpu_time1 =
0.0067
cpu_time2 =
3.2827e-05
gpu1_time1 =
0.3628
gpu1_time2 =
1.9738e-04
gpu2_time1 =
9.8100e-04
gpu2_time2 =
1.1938e-04
These results are quite shocking to me and have left me with a few questions:
1. The timeit results are consistently faster which makes sense given they are optimized for getting good timings on individual timings, but which time should I expect to achieve in my code? For example, if I call this function within a for loop with different inputs each time, should I assume the average time to be closer to the tic toc time or the timeit time? It is quite a big difference!
2. How is it that the first gpuArray approach is so slow? The time taken to put the matrix and vectors into the GPU is not even included in the time, but yet it is much slower than the CPU. Even the timeit time is slightly longer. Is there a good reason for this?
3. What is the best option here? The CUDA version managed to get a faster tic toc time but slower timeit time. I guess this goes back to question 1.
Any help is greatly appreciated! Thank you!
##### 2 CommentsShowHide 1 older comment
Joss Knight on 10 Aug 2019
In answer to your question, the only way to know what the fastest technique is for your workflow is to time your actual workflow. timeit is representative for code that is used repeatedly. But if your sizes are constantly changing, your simple test will not capture the extra overheads that will incur.