MATLAB Answers

DGM
1

Speed of masked matrix operations in 'single' vs 'double'

Asked by DGM
on 10 Jul 2018
Latest activity Answered by Image Analyst
on 11 Jul 2018
I've been going through a lot of my tools and trying to make things faster and reduce memory use. I know that using double-precision FP as my default working datatype is part of the problem regarding memory, but I had expected using single-precision may be faster as well.
Simple tests seem to indicate that it would be (these times are all averages of many tests):
% 105.6ms for double
% 50.4ms for single
R=imfilter(bg,fs);
% 7.6ms for double
% 4.4ms for single
R=flipdim(bg,2);
% 45ms double
% 22ms single
R=bg.*fg;
% 4.5ms double
% 3.2ms single
R=fg.^2 + 2*bg.*fg.*(1-bg);
but operations involving masking via multiplication were significantly slower in single:
% 6.0ms double
% 25.7ms single!
hi=I>0.5;
R=(1-2*(1-I).*(1-M)).*hi + (2*M.*I).*~hi;
Explicitly casting the logical mask as numeric and handling it without the NOT operator does speed things up a bit, but either case with numeric masks is still slower than using double with logical masks.
% 7.7ms double
% 9.8ms single
hi=single(I>0.5);
R=(1-2*(1-I).*(1-M)).*hi + (2*M.*I).*(1-hi);
You might ask why I'm masking via multiplication in the first place. Why not just use logical indexing? I used to do everything that way, but apparently overcalculation is faster than a bunch of logical indexing:
% 62.1ms double
% 53.7ms single
hi=I>0.5; lo=~hi;
R=zeros(size(I),'single'); % preallocate with appropriate wclass
R(lo)=2*I(lo).*M(lo);
R(hi)=1-2*(1-M(hi)).*(1-I(hi));
Am I misguided to expect reliable speed gains from using single-precision for a wide range of operations across different machines (this code will be used by others)? Comments like this make me think so.
That, and if I were to pursue this flexibility for the conservation of memory alone, is there a better approach to masked operations than what I've described?

  0 Comments

Sign in to comment.

2 Answers

Answer by Matt J
on 10 Jul 2018
Edited by Matt J
on 10 Jul 2018

Am I misguided to expect reliable speed gains from using single-precision for a wide range of operations across different machines
Well, no, you're not misguided, assuming you're using a recent version of Matlab, and your post demonstrates that indeed you do achieve gains over a wide range of operations. Just not all operations.
I don't have a good explanation for the behavior, but I'm guessing that there are difficulties in writing multi-threaded code in a type-generic way.

  5 Comments

Multi-threading also could be a factor in this. Certain operations with certain classes could have multi-threaded compiled code in the background for them, while the same operation with different classes only have single threaded code in the background. As MATLAB matures and the parser gets smarter, things like this can easily change between versions.
That might explain some of the cpu usage patterns observed when profiling the different operations. I was only logging average times, but when I was running the overall tests in 'double', I noticed that it was more frequently occupying more cpu cores. I haven't checked to see which cases exhibited that behavior though (there were ~80 different cases being tested sequentially).
This might be an idea to shelve for a couple years. It seems I'm almost always running an older version than the other students, and I'd hate to optimize something for myself that makes things worse for everyone else.
When I put together the information from http://www.agner.org/optimize/instruction_tables.pdf and https://www.felixcloutier.com/x86/index.html, I get the impression that for most of the processors in the x86 and x64 architecture, the only difference in rates for signed multiplication (FMUL or IMUL) for single precision and double precision, would be entirely due to differences in whether 32 bits or 64 bits were being transferred from memory, and for integer there would be an additional latency of conversion to floating point.
Addition looks like it can get pretty complicated, with numerous different modes related to various forms of packing, and related to fused instructions. It looks like there are different instructions for adding scalar single precision and for scalar double precision, but the latency tables give the same rates for single and double precision.
The material indicates that if nan or inf is part of the data then the computation can take up to 100 cycles longer.

Sign in to comment.


Answer by Image Analyst
on 11 Jul 2018

According to Intel (10 years ago so might not be true still today), double should be slower than single :
but there are lots of "depends" and ways to speed it up. See the discussion for details.

  0 Comments

Sign in to comment.