Speed up Matrix Subtraction for Euclidean Distance calculation

I have 2 matrices, both of them are feature matrices, M1 is 9216x26310, M2 is 9216x34000. each column represents 9216x1 represents one feature for a particular image, in this case 26310 images for the M1, 34000 for M2. I want to speed up the time for it to calculate euclidean distance, my current code takes average 1.3 secs or 1.4 secs per calculation.
tic;
for idx = 1:Number_of_Test_Images
for TrnIdx = 1 : Number_of_Train_Images
E_distance(TrnIdx) = sqrt(sum(abs((ftest(:,idx)-ftrain(:,TrnIdx)).^2)));
end
%
%[smallest_value, Idx_smallest] = min(E_distance);
%
Esimate_Test_labels(idx,1) = TrnLabels(Idx_smallest,1);
toc;
end
even if I replace the nested for loop with this
[smallest_value,Idx_smallest] = min(sum(abs(bsxfun(@minus,ftest(:,idx),ftrain))));
it still takes roughly the same time, although this one doesn't calculate sqrt.
Is there other way to speed this up to about 0.2 secs per calculation?
I have a i7 6700k CPU sadly no Nvidia GPU, owning AMD GPU. Else I can use gpuarray to speed things up.
Any ideas guys?
Both matrices are type double, I've tried to use sparse double on the my original code, my computer hangs every time after it reaches maximum memory usage, had to restart. and Sparse double matrices actually takes longer to compute.
Update! I've tried convert both matrices into single precision, and it runs faster, average of 0.63 secs per calculation by using the formula below, which runs the fastest compare to others.
I still need it to run faster, preferably under 0.2 secs if that's possible.
for TrnIdx = 1 : Number_of_Train_Images
E_distance(TrnIdx) = norm(ftest(:,idx)-ftrain(:,TrnIdx));
end
% runs averagely of 0.63secs which is the best at the moment, after converting both matrices from double to single, double runs at 1.05secs

14 Comments

Do you have a supported C compiler installed for building mex routines?
I think yes? I have tried to use mex function to build it in C language, But I get errors, saying input parameter size exceeds certain bytes. and I'm not even sure I wrote it correctly.
What are the timing results for pdist2 of the transpose of the matrices?
2.6 secs for using pdist2.
Assuming ‘ftest’ and ‘ftrain’ vectors are real-valued, then the ‘abs’ operation is totally unnecessary and a waste of time. The square of any real number is always non-negative.
Also you might try the default 2-norm operator in place of the computation you have there:
E_distance(TrnIdx) = norm(ftest(:,idx)-ftrain(:,TrnIdx));
It might be faster and gives the same answer.
Yes it does, but slightly, average 1.05 sec per calculation. But that shrink the computation time of the entire operation from 9hours+ to 7hours+
One bit I would suggest is to calculate the squared distance and then call sqrt on the result. Or at least sqrt blocks instead of individual calculations. This is to avoid the overhead of multiple calls, and to give a chance for SIMD instructions to be used.
E_distance(TrnIdx) = norm(ftest(:,idx)-ftrain(:,TrnIdx));
I'm currently using this code, gives the fastest result. average 1.05 - 1.1 secs. Entire operation would still take 7hours and 50mins+.
One thing I would suggest is to move ftest(:,idx) to before the loop:
this_ftest = ftest(:,idx);
for TrnIdx = 1 : Number_of_Train_Images
E_distance(TrnIdx) = norm(this_ftest - ftrain(:,TrnIdx));
end
The time you are measuring corresponds to 34000 norm calculations of length 9216 each. So that is 9216 subtractions each, 9216 squares, 9215 sums, and 1 square root, so a minimum of 9216*3 = 27648 floating point operations per norm, ignoring for the moment any memory access costs. 27648 * 34000 = 940032000 floating point operations. That's about a gigaflop. For that to happen in 0.2 seconds your system would have to be running at 5 gigaflop (even assuming that the squares and square roots took the same time as a additions and subtractions). The i7 6700k is rated at most 4 GHz, so at most 4 gigaflop. Therefore with the most efficient possible operations all perfectly pipelined, you cannot expect to do better than 0.25 seconds.
Unless, that is, you can take advantage of multiple cores. Which is plausible.
My timing tests with norm() suggest that for 34000 items of length 9216, you probably are not going to be able to do better than 0.46 seconds (unless you use multiple cores.)
This is correct. However, to achieve what he actually wants to do, he doesn't actually need to do that many subtractions. His original code is asking the wrong question, and finding the answer inefficiently on top of that.
The right question here is the nearest neighbor problem. Optimizations of queries of this nature (search) is a very well studied area, with many exact and approximate solutions.
I believe the MATLAB implementation is taking advantage of some tree structure to do a fast exact search, reducing the number of necessary comparisons by structuring the searched data in a particular way.
A very optimized approximate algorithm is FLANN which has MATLAB bindings. This should further reduce the execution time of the entire task to around 3-5 minutes, with a minor fraction of inaccurate results (corresponding to the 5th nearest neighbor instead of 1st, for example).
I do have multi cores, unless you're talking about parfor loop? Never mind that, I've got the answer already.
NS = createns(Ftrain');
Idx = knnsearch(NS,Ftest');
Which gives all the 26310 smallest indexes that I wanted. and It took only 2280 secs, which is 38 minutes.
Tested that code above with my code, gives the same result.
My testing with parfor and norm indicated it was about 7 times slower than a straight forward norm loop.

Sign in to comment.

 Accepted Answer

If you have the stats & ML toolbox:
NS = createns(M1'); % Create Search Object for The Train Set
Idx = knnsearch(NS,M2'); % Search Nearest Element of M1 for Each Element in M2
This should be lightning fast compared to your original code, probably would take around 20-30 minutes for the entire task. If you run out of memory, chunk the test matrix into smaller groups, and provide that as input, instead of looping over them one-by-one with i.

3 Comments

the Idx matrix I get from your code seems wrong, When I pass it a single feature ftest(:,1), it gives a Idx matrix of 34000x1 with all 1 filled in the matrix.
Are you sure you are using the Tranpose matrices? Each row should be an image, each column a dimension.
This works for me without any error.
Here:
A = rand(9216,26310);
>> B = rand(9216,10);
>> tic;NS = createns(A');toc;
Elapsed time is 1.410517 seconds.
>> tic; Idx = knnsearch(NS,B'); toc;
Elapsed time is 1.992841 seconds.
>> Idx
Idx =
15172
22166
1461
17563
6626
16701
12601
10224
16658
15376
Your code results:
tic;
for idx = 1:10
for TrnIdx = 1 : 26310
E_distance(TrnIdx) = sqrt(sum(abs((B(:,idx)-A(:,TrnIdx)).^2)));
end
[smallest_value, Idx_smallest] = min(E_distance);
YourMethodIDX(idx) = Idx_smallest;
end
toc;
Elapsed time is 11.048696 seconds.
YourMethodIDX'
ans =
15172
22166
1461
17563
6626
16701
12601
10224
16658
15376
Note that while the speed improvement here is marginal, it will be Orders of Magnitude faster if you input your entire matrix.
Thanks so much, now from from 4h30min+ to 38 minutes!!! Thanks so so much.

Sign in to comment.

More Answers (1)

You could try DNORM2( Download ), as applied to
E_distance = DNorm2( bsxfun(@minus,ftest(:,idx),ftrain) ,1) ; %Equation (*)
You could also try parallelizing the outer loop with parfor and maybe even the computation of E_distance as well (break ftrain into smaller parallel chunks and perform pieces of Equation (*) above on different workers).
Finally, if ftest and ftrain are type double, you might try moving to type single.

5 Comments

This is the message I get. ??? DNorm2: Cannot find compiler MEX file. Slow Matlab version is used!
already have DNorm2.m and DNorm2.c at the current folder.
Try installing a compiler, using mex -setup and then using mex to compile the code.
I'm running into tons of problem on installing that compiler. Matlab is really crappy at this huh.
No, it's usually not a big problem. Tech support might be able to help you with the installation.
2.1 Secs using DNorm2. Running out of ideas now.

Sign in to comment.

Asked:

TYS
on 30 Dec 2016

Commented:

on 31 Dec 2016

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!