GPU time slower than CPU time, what went wrong with my GPU implementation?

10 views (last 30 days)
Ruby Fu
Ruby Fu on 19 Jan 2012
Commented: ALysko on 14 Apr 2015
Hi all, I have been testing the GPU computing feature in MATLAB. The code below is running and timing large matrix multiplications (1024x1024) using CPU and GPU computing:
A=rand(1024);
gA=gpuArray(A);
%warming up
for i=1:10
C=A*A;
gC=gA*gA;
end
tic,C=A*A;toc;
tic,gC=gA*gA; toc;
After many trials, the results using CPU turns out to be faster than GPU time. I am surprised because this guy on stackoverflow forum did the exact testing and he proved that using GPU is faster:
>> A = rand(1024); gA = gpuArray(A);
% warm up by executing the operations a couple of times, and then:
>> tic, C = A * A; toc
Elapsed time is 0.075396 seconds.
>> tic, gC = gA * gA; toc
Elapsed time is 0.008621 seconds.
The only reason I can think of is that we are using different GPUs. The other guy has a Tesla C2070 while the laptop I am using is Dell Inspirion17R (NVIDIA GeForce GT 525M).
Could it be possible that by using a lesser GPU, the computation is actually slower than using CPU ?
Thank you! Ruby
  1 Comment
ALysko
ALysko on 14 Apr 2015
A bit of extra info regarding double precision performance:
Tesla C2070 and GeForce GT 525M are two very different GPUs: Tesla C2070: 1.03TFlops/0.515TFlops (single/double precision) GeForce GT 525M: 0.23TFlops / 0.031TFlops
Titan Black may need a manual switch to enable full double precision:
1) the web page http://nvidianews.nvidia.com/news/nvidia-introduces-geforce-gtx-titan-dna-of-the-world-s-fastest-supercomputer-powered-by-world-s-fastest-gpu and the page 44 of the PDF "GeForce-Update-Feb-2014.pdf" at says that Titan Black has Single Precision 5.1 Teraflops Double Precision1.3 Teraflops
2) the web page http://www.bit-tech.net/news/hardware/2014/02/18/nvidia-gtx-titan-black-launched/1 compares the Titan Black to just Titan (tested by Mathworks): Titan Black: 5.1TFlops / 1.2TFlops Titan: 4.5TFlops / 1.3TFlops
(Thus, the benchmarks for Titan by Mathworks should be similar or worse than the benchmarks for Titan Black)
3) The page https://devtalk.nvidia.com/default/topic/716573/gtx-titan-double-precision-flops-way-off-specs/ talks specifically about the Mathworks benchmarks with gpuBench():
Before any changes (default settings): MTimes_D Backslash_D FFT_D MTimes_S Backslash_S FFT_S Tesla C2075 333 246 73 696 435 163 GF GTX TITAN 223 82 77 3635 179 252
After (switching the card into double precision in Control Panel): MTimes_D Backslash_D FFT_D MTimes_S Backslash_S FFT_S Tesla C2075 333 246 73 696 435 163 GeForce GTX TITAN 1285 128 146 3423 182 227

Sign in to comment.

Accepted Answer

Ben Tordoff
Ben Tordoff on 20 Jan 2012
Hi Ruby,
I've just uploaded a benchmarking tool to the File Exchange which runs a whole load of these type of timings to put your GPU in context with others in the market:
One thing to bear in mind is that virtually all GPUs that aren't explicitly designed for scientific computing are optimized for single-precision maths (as is used by OpenGL etc.). GeForce cards, mobile or otherwise, are quite good for single-precision performance but usually about 8x worse for double. MATLAB defaults to using double-precision everywhere. Of the NVIDIA cards, only the Tesla and top-end Quadro series do well at double-precision. Add to that the fact that a mobile GPU typically has far fewer cores than a desktop one, and I'd be amazed if you saw any significant speed-ups compared to a modern mobile CPU when doing double-precision maths.
Anyway, give the benchmark a try and let us all know what you find.
Cheers
Ben

More Answers (1)

Walter Roberson
Walter Roberson on 19 Jan 2012
Your GeForce GT 525M would be handling the graphics rendering, whereas the Tesla probably would not be handling graphics (and can be specifically configured to take it off graphics duties, I seem to recall.)
The GT 525M has 96 cores at up to 1.2 GHz; the Tesla C2070 has 448 cores at 1.15 GHz -- 4 times the cores.
  2 Comments
Walter Roberson
Walter Roberson on 19 Jan 2012
I only know some broad outlines on how things work. I know that time to load and unload the data can overwhelm the benefits of using GPUs. Large enough matrix multiply done in CPU are normally farmed out to LAPACK, which is highly optimized and uses multiple cores. The trade-off point of "large enough" could in theory depend upon which CPU you are using, but I do not know if MATLAB takes that in to account. You would need to know about the relative CPU capabilities to compare GPU/CPU figures meaningfully.
I believe that Accelereye's Jacket is benchmarked as faster than the native MATLAB GPU.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!