Parallel GPU operations increase execution time

Question

Tobias on 15 Oct 2014

1
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/158757-parallel-gpu-operations-increase-execution-time

Commented: matt dash on 16 Oct 2014

Hello,

I'm experiencing a problem when running more than one Matlab instance that operates on the GPU. Going from one to 3 Matlab instances, the execution time for my code triples, which makes running simulations in parallel pretty much pointless.

I am using Matlab R2014a on openSUSE 13.1 with 2 CPUs (E5-2650 v2 @ 2.6GHz) and 64GB RAM. This is the gpuDevice output

      >> gpuDevice 
      ans = 
        CUDADevice with properties:
                            Name: 'Tesla K20c'
                           Index: 1
               ComputeCapability: '3.5'
                  SupportsDouble: 1
                   DriverVersion: 6.5000
                  ToolkitVersion: 5.5000
              MaxThreadsPerBlock: 1024
                MaxShmemPerBlock: 49152
              MaxThreadBlockSize: [1024 1024 64]
                     MaxGridSize: [2.1475e+09 65535 65535]
                       SIMDWidth: 32
                     TotalMemory: 5.3685e+09
                      FreeMemory: 3.1440e+09
             MultiprocessorCount: 13
                    ClockRateKHz: 705500
                     ComputeMode: 'Default'
            GPUOverlapsTransfers: 1
          KernelExecutionTimeout: 0
                CanMapHostMemory: 1
                 DeviceSupported: 1
                  DeviceSelected: 1

Here's the code I'm using for my benchmark.

CPU Benchmark

tic;
n=3e6;
x=randn(n,2)+1j*randn(n,2);
y=randn(n,2);
loops=50;
for i=1:loops;
z=fft(x.*y);
x=ifft(z.*2);
end;
toc;

GPU Benchmark

tic;
n=3e6;
x=randn(n,2)+1j*randn(n,2);
y=randn(n,2);
loops=1000;
x = gpuArray(x);
y = gpuArray(y);
for i=1:loops;
z=fft(x.*y);
x=ifft(z.*2);
end;
x=gather(x);
toc;

CPU Performance Output

Here's the output for one Matlab instance:

Elapsed time is 31.025174 seconds.

And here for 3 parallel Matlab instances (the execution time is similar on the other instances):

Elapsed time is 38.570451 seconds.

Which is fine and works as expected. Now the GPU performance.

GPU Performance Output

One Matlab instance:

Elapsed time is 17.168247 seconds.

3 parallel Matlab instances (the execution time is again similar on the other instances)

Elapsed time is 49.008788 seconds.

So the execution time almost triples for 3 parallel Matlab instances.

Some more debug output:

tueilnt-sim01:/ # nvidia-smi
Wed Oct 15 18:18:15 2014
+------------------------------------------------------+
| NVIDIA-SMI 340.29     Driver Version: 340.29         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20c          Off  | 0000:03:00.0     Off |                  Off |
| 48%   61C    P0   143W / 225W |   2121MiB /  5119MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce 210         Off  | 0000:82:00.0     N/A |                  N/A |
| N/A   47C    P8    N/A /  N/A |      2MiB /   511MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0      3074  /nas/ei-tools/matlab/R2014a/bin/glnxa64/MATLAB       700MiB |
|    0      3134  /nas/ei-tools/matlab/R2014a/bin/glnxa64/MATLAB       700MiB |
|    0      3182  /nas/ei-tools/matlab/R2014a/bin/glnxa64/MATLAB       700MiB |
|    1            Not Supported                                               |
+-----------------------------------------------------------------------------+
tueilnt-sim01:/ # /usr/local/cuda-6.5/samples/1_Utilities/deviceQuery/deviceQuery 
/usr/local/cuda-6.5/samples/1_Utilities/deviceQuery/deviceQuery Starting...
   CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "Tesla K20c"
  CUDA Driver Version / Runtime Version          6.5 / 6.5
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 5120 MBytes (5368512512 bytes)
  (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
  GPU Clock rate:                                706 MHz (0.71 GHz)
  Memory Clock rate:                             2600 Mhz
  Memory Bus Width:                              320-bit
  L2 Cache Size:                                 1310720 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           3 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "GeForce 210"
  CUDA Driver Version / Runtime Version          6.5 / 6.5
  CUDA Capability Major/Minor version number:    1.2
  Total amount of global memory:                 511 MBytes (536150016 bytes)
  ( 2) Multiprocessors, (  8) CUDA Cores/MP:     16 CUDA Cores
  GPU Clock rate:                                1402 MHz (1.40 GHz)
  Memory Clock rate:                             625 Mhz
  Memory Bus Width:                              32-bit
  Maximum Texture Dimension Size (x,y,z)         1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(8192), 512 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(8192, 8192), 512 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           512
  Max dimension size of a thread block (x,y,z): (512, 512, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 1)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           130 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA Runtime Version = 6.5, NumDevs = 2, Device0 = Tesla K20c, Device1 = GeForce 210
Result = PASS

Does anybody have an idea of what I might be doing wrong?

Thanks a lot!

Tobias

5 Comments
Show 3 older commentsHide 3 older comments

Tobias on 16 Oct 2014

Hello,

Thanks for the comment. I think I had a wrong understanding of whether a GPU (with huge memory) can be shared by multiple processes. Apparently that's not the case, at least not out of the box. Lesson learned, the next sim machine will have more separate GPUs with smaller memory instead of one huge one.

BTW The code I'm running is actually pretty close to my example here, it's the split step Fourier method for numerically solving the nonlinear Schrödinger equation.

Regards, Tobias

matt dash on 16 Oct 2014

I ran your experiment on my gpu... i think it's working as expected. THe profiler shows that when i run the code simultaneously on 2 matlabs, the fft and ifft lines take roughly 2x as long. There is no indication that the gpu is only running one of them at a time... it appears to be doing both simultaneously. So i think the simple answer is that my 2700 cores are just having to share the work, so it takes 2x as long when I have it running 2 times simultaneously.

I don't think there's a fundamental issue with multiple matlabs sharing the gpu, but there's only so much processing power to go around. I would expect that 2 separate gpu's with 1350 cores each and would perform the same. I suspect that in the long run you'll get more use out of one really good GPU compared to multiple not so good ones... since one good one can handle multiple tasks, but getting multiple gpus to collaborate on the same process is more complicated (if it's possible at all)

Sign in to comment.

Sign in to answer this question.