GPU time slower than CPU time, what went wrong with my GPU implementation?

Question

0 votes

Hi all, I have been testing the GPU computing feature in MATLAB. The code below is running and timing large matrix multiplications (1024x1024) using CPU and GPU computing:

    A=rand(1024);
    gA=gpuArray(A);
    %warming up
    for i=1:10
        C=A*A;
        gC=gA*gA;
    end
    tic,C=A*A;toc;
    tic,gC=gA*gA; toc;

After many trials, the results using CPU turns out to be faster than GPU time. I am surprised because this guy on stackoverflow forum did the exact testing and he proved that using GPU is faster:

    >> A = rand(1024); gA = gpuArray(A);
    % warm up by executing the operations a couple of times, and then:
    >> tic, C = A * A; toc
    Elapsed time is 0.075396 seconds.
    >> tic, gC = gA * gA; toc
    Elapsed time is 0.008621 seconds.

The only reason I can think of is that we are using different GPUs. The other guy has a Tesla C2070 while the laptop I am using is Dell Inspirion17R (NVIDIA GeForce GT 525M).

Could it be possible that by using a lesser GPU, the computation is actually slower than using CPU ?

Thank you! Ruby

1 Comment
Show -1 older comments Hide -1 older comments

ALysko on 14 Apr 2015

A bit of extra info regarding double precision performance:

Tesla C2070 and GeForce GT 525M are two very different GPUs: Tesla C2070: 1.03TFlops/0.515TFlops (single/double precision) GeForce GT 525M: 0.23TFlops / 0.031TFlops

Titan Black may need a manual switch to enable full double precision:

1) the web page http://nvidianews.nvidia.com/news/nvidia-introduces-geforce-gtx-titan-dna-of-the-world-s-fastest-supercomputer-powered-by-world-s-fastest-gpu and the page 44 of the PDF "GeForce-Update-Feb-2014.pdf" at says that Titan Black has Single Precision 5.1 Teraflops Double Precision1.3 Teraflops

2) the web page http://www.bit-tech.net/news/hardware/2014/02/18/nvidia-gtx-titan-black-launched/1 compares the Titan Black to just Titan (tested by Mathworks): Titan Black: 5.1TFlops / 1.2TFlops Titan: 4.5TFlops / 1.3TFlops

(Thus, the benchmarks for Titan by Mathworks should be similar or worse than the benchmarks for Titan Black)

3) The page https://devtalk.nvidia.com/default/topic/716573/gtx-titan-double-precision-flops-way-off-specs/ talks specifically about the Mathworks benchmarks with gpuBench():

Before any changes (default settings): MTimes_D Backslash_D FFT_D MTimes_S Backslash_S FFT_S Tesla C2075 333 246 73 696 435 163 GF GTX TITAN 223 82 77 3635 179 252

After (switching the card into double precision in Control Panel): MTimes_D Backslash_D FFT_D MTimes_S Backslash_S FFT_S Tesla C2075 333 246 73 696 435 163 GeForce GTX TITAN 1285 128 146 3423 182 227

4) How to switch into double precision (which limits the GPU clock boost): http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/59785-nvidia-geforce-gtx-titan-6gb-performance-review-2.html http://forums.evga.com/When-to-Use-Double-Precision-under-NVIDIA-Control-Panel-Manage-3D-Settings-m2252867.aspx http://nvidia.custhelp.com/app/answers/detail/a_id/3130/~/setting-power-management-mode-from-adaptive-to-maximum-performance http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/59785-nvidia-geforce-gtx-titan-6gb-performance-review-2.html and for linux: http://ambermd.org/gpus/

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Ben Tordoff on 20 Jan 2012

1 vote

Hi Ruby,

I've just uploaded a benchmarking tool to the File Exchange which runs a whole load of these type of timings to put your GPU in context with others in the market:

http://www.mathworks.com/matlabcentral/fileexchange/34080-gpubench

One thing to bear in mind is that virtually all GPUs that aren't explicitly designed for scientific computing are optimized for single-precision maths (as is used by OpenGL etc.). GeForce cards, mobile or otherwise, are quite good for single-precision performance but usually about 8x worse for double. MATLAB defaults to using double-precision everywhere. Of the NVIDIA cards, only the Tesla and top-end Quadro series do well at double-precision. Add to that the fact that a mobile GPU typically has far fewer cores than a desktop one, and I'd be amazed if you saw any significant speed-ups compared to a modern mobile CPU when doing double-precision maths.

Anyway, give the benchmark a try and let us all know what you find.

Cheers

Ben

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Answer 2

Walter Roberson on 19 Jan 2012

1 vote

Your GeForce GT 525M would be handling the graphics rendering, whereas the Tesla probably would not be handling graphics (and can be specifically configured to take it off graphics duties, I seem to recall.)

The GT 525M has 96 cores at up to 1.2 GHz; the Tesla C2070 has 448 cores at 1.15 GHz -- 4 times the cores.

2 Comments
Show None Hide None

Ruby Fu on 19 Jan 2012

Hi Walter,

Thanks for the response.

I think your answer explains why the GPU computing i performed is slower than the one performed using Tesla. However, I am also seeing that my GPU computing time is longer than the CPU computing time for the same code. Is this also due to the different number of cores the two types of hardware provide? Is there a possibility that this MATLAB feature can be improved?

Thanks!

Walter Roberson on 19 Jan 2012

I only know some broad outlines on how things work. I know that time to load and unload the data can overwhelm the benefits of using GPUs. Large enough matrix multiply done in CPU are normally farmed out to LAPACK, which is highly optimized and uses multiple cores. The trade-off point of "large enough" could in theory depend upon which CPU you are using, but I do not know if MATLAB takes that in to account. You would need to know about the relative CPU capabilities to compare GPU/CPU figures meaningfully.

I believe that Accelereye's Jacket is benchmarked as faster than the native MATLAB GPU.

Sign in to comment.

GPU time slower than CPU time, what went wrong with my GPU implementation?

1 Comment
Show -1 older comments Hide -1 older comments

Accepted Answer

0 Comments
Show -2 older comments Hide -2 older comments

More Answers (1)

2 Comments
Show None Hide None

Categories

Products

Tags

Community Treasure Hunt

GPU time slower than CPU time, what went wrong with my GPU implementation?

1 Comment Show -1 older comments Hide -1 older comments

Accepted Answer

0 Comments Show -2 older comments Hide -2 older comments

More Answers (1)

2 Comments Show None Hide None

Categories

Products

Tags

See Also

Community Treasure Hunt

1 Comment
Show -1 older comments Hide -1 older comments

0 Comments
Show -2 older comments Hide -2 older comments

2 Comments
Show None Hide None