gather takes really long after using ptx file /CUDA

Question

Robert on 24 Jan 2013

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/59782-gather-takes-really-long-after-using-ptx-file-cuda

I try to make a matrixmultiplication using CUDA via ptx file to take advantage over the matlab internal functions. My .cu codes can calculate the matrixmultiplication faster than matlabs internal function, but the gather command after running the kernel takes much longer than after using matlab commands and gpuArray before:

Here my matlab file to compare both:

    g=gpuDevice
    reset(g)
  clear all
N=1024;
A=rand(N,N);
B=rand(N,N);
%gpuDevice using Matlab
A_gpu=gpuArray(A);%Create array on GPU
B_gpu=gpuArray(B);%Create array on GPU
tic
C_gpu=A_gpu*B_gpu;
toc
tic
C=gather(C_gpu);%copy array from GPU to CPU
toc
%now using CUDA
A=A';
a_gpu=gpuArray(A(:)');%Create array on GPU make vector
b_gpu=gpuArray(B(:)');%Create array on GPU make vector
c_gpu=gpuArray(zeros(N*N,1));
k = parallel.gpu.CUDAKernel('matrixmul.ptx', 'matrixmul.cu');
k.ThreadBlockSize = [N,1,1];
k.GridSize=[N,N];
tic
[o] = feval(k, c_gpu,a_gpu,b_gpu);
o=reshape(o,N,N);
toc
tic
c2=gather(o);%back to host
toc
%check
max(max(abs(C-c2)))

My .cu file looks like this: _global_ void matrixmul( double *c, double *a, double *b) { _shared_ double cache[1024]; int cacheIndex = threadIdx.x;

  int Aind=threadIdx.x + blockIdx.x * gridDim.y;
  int Bind=threadIdx.x + blockIdx.y * gridDim.x;
        cache[cacheIndex]=a[Aind]*b[Bind];
  __syncthreads();
  int i=blockDim.x/2;
  while (i != 0) {
    if (cacheIndex<i)
      cache[cacheIndex]+=cache[cacheIndex+i];
    __syncthreads();
    i/=2;
  } 
  if (cacheIndex == 0)
    c[blockIdx.y *gridDim.y + blockIdx.x ]=cache[0];
  
}

In my version i use directly vectors instead of matrices and I transposed the 2 Matrix before starting the calculation to take advantage of the better order inside the vector for the memory access:

Thats what I get back: Elapsed time is 0.110911 seconds. Elapsed time is 0.007010 seconds. Elapsed time is 0.001937 seconds. Elapsed time is 3.651635 seconds.

ans =
     1.0800e-12

As you see the first gather command takes only 0.007 seconds while the second one needs more than 3sec. Also if I put all my calling stuff into some function, also the call of this function takes a lot (without even reading the gpuArray.

Any suggestions whats going wrong here?

Thanks

Robert

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

James Lebak on 24 Jan 2013

1
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/59782-gather-takes-really-long-after-using-ptx-file-cuda#answer_72295

Open in MATLAB Online

In MATLAB R2012a and later, GPU functions execute asynchronously in MATLAB. To get accurate timings, you need to call the wait function to make sure that gpu execution is finished. To accurately measure the time taken by MATLAB's multiply or by your kernel, rewrite your code as follows:

tic
C_gpu=A_gpu*B_gpu;
wait(g); % g is the value returned by gpuDevice, above
toc
tic
[o] = feval(k, c_gpu,a_gpu,b_gpu);
o=reshape(o,N,N);
wait(g); % g is the value returned by gpuDevice, above
toc