Parallel GPU operations increase execution time

5 views (last 30 days)
Hello,
I'm experiencing a problem when running more than one Matlab instance that operates on the GPU. Going from one to 3 Matlab instances, the execution time for my code triples, which makes running simulations in parallel pretty much pointless.
I am using Matlab R2014a on openSUSE 13.1 with 2 CPUs (E5-2650 v2 @ 2.6GHz) and 64GB RAM. This is the gpuDevice output
>> gpuDevice
ans =
CUDADevice with properties:
Name: 'Tesla K20c'
Index: 1
ComputeCapability: '3.5'
SupportsDouble: 1
DriverVersion: 6.5000
ToolkitVersion: 5.5000
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 5.3685e+09
FreeMemory: 3.1440e+09
MultiprocessorCount: 13
ClockRateKHz: 705500
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 0
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
Here's the code I'm using for my benchmark.
CPU Benchmark
tic;
n=3e6;
x=randn(n,2)+1j*randn(n,2);
y=randn(n,2);
loops=50;
for i=1:loops;
z=fft(x.*y);
x=ifft(z.*2);
end;
toc;
GPU Benchmark
tic;
n=3e6;
x=randn(n,2)+1j*randn(n,2);
y=randn(n,2);
loops=1000;
x = gpuArray(x);
y = gpuArray(y);
for i=1:loops;
z=fft(x.*y);
x=ifft(z.*2);
end;
x=gather(x);
toc;
CPU Performance Output
Here's the output for one Matlab instance:
Elapsed time is 31.025174 seconds.
And here for 3 parallel Matlab instances (the execution time is similar on the other instances):
Elapsed time is 38.570451 seconds.
Which is fine and works as expected. Now the GPU performance.
GPU Performance Output
One Matlab instance:
Elapsed time is 17.168247 seconds.
3 parallel Matlab instances (the execution time is again similar on the other instances)
Elapsed time is 49.008788 seconds.
So the execution time almost triples for 3 parallel Matlab instances.
Some more debug output:
tueilnt-sim01:/ # nvidia-smi
Wed Oct 15 18:18:15 2014
+------------------------------------------------------+
| NVIDIA-SMI 340.29 Driver Version: 340.29 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20c Off | 0000:03:00.0 Off | Off |
| 48% 61C P0 143W / 225W | 2121MiB / 5119MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce 210 Off | 0000:82:00.0 N/A | N/A |
| N/A 47C P8 N/A / N/A | 2MiB / 511MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 3074 /nas/ei-tools/matlab/R2014a/bin/glnxa64/MATLAB 700MiB |
| 0 3134 /nas/ei-tools/matlab/R2014a/bin/glnxa64/MATLAB 700MiB |
| 0 3182 /nas/ei-tools/matlab/R2014a/bin/glnxa64/MATLAB 700MiB |
| 1 Not Supported |
+-----------------------------------------------------------------------------+
tueilnt-sim01:/ # /usr/local/cuda-6.5/samples/1_Utilities/deviceQuery/deviceQuery
/usr/local/cuda-6.5/samples/1_Utilities/deviceQuery/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "Tesla K20c"
CUDA Driver Version / Runtime Version 6.5 / 6.5
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 5120 MBytes (5368512512 bytes)
(13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
GPU Clock rate: 706 MHz (0.71 GHz)
Memory Clock rate: 2600 Mhz
Memory Bus Width: 320-bit
L2 Cache Size: 1310720 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "GeForce 210"
CUDA Driver Version / Runtime Version 6.5 / 6.5
CUDA Capability Major/Minor version number: 1.2
Total amount of global memory: 511 MBytes (536150016 bytes)
( 2) Multiprocessors, ( 8) CUDA Cores/MP: 16 CUDA Cores
GPU Clock rate: 1402 MHz (1.40 GHz)
Memory Clock rate: 625 Mhz
Memory Bus Width: 32-bit
Maximum Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(8192), 512 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(8192, 8192), 512 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 512
Max dimension size of a thread block (x,y,z): (512, 512, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 1)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 130 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA Runtime Version = 6.5, NumDevs = 2, Device0 = Tesla K20c, Device1 = GeForce 210
Result = PASS
Does anybody have an idea of what I might be doing wrong?
Thanks a lot!
Tobias
  5 Comments
Tobias
Tobias on 16 Oct 2014
Hello,
Thanks for the comment. I think I had a wrong understanding of whether a GPU (with huge memory) can be shared by multiple processes. Apparently that's not the case, at least not out of the box. Lesson learned, the next sim machine will have more separate GPUs with smaller memory instead of one huge one.
BTW The code I'm running is actually pretty close to my example here, it's the split step Fourier method for numerically solving the nonlinear Schrödinger equation.
Regards, Tobias
matt dash
matt dash on 16 Oct 2014
I ran your experiment on my gpu... i think it's working as expected. THe profiler shows that when i run the code simultaneously on 2 matlabs, the fft and ifft lines take roughly 2x as long. There is no indication that the gpu is only running one of them at a time... it appears to be doing both simultaneously. So i think the simple answer is that my 2700 cores are just having to share the work, so it takes 2x as long when I have it running 2 times simultaneously.
I don't think there's a fundamental issue with multiple matlabs sharing the gpu, but there's only so much processing power to go around. I would expect that 2 separate gpu's with 1350 cores each and would perform the same. I suspect that in the long run you'll get more use out of one really good GPU compared to multiple not so good ones... since one good one can handle multiple tasks, but getting multiple gpus to collaborate on the same process is more complicated (if it's possible at all)

Sign in to comment.

Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!