Compute pairwise square Euclidean or Mahalanobis distances between points sets (fully optimized!).
This function computes pairwise distance between two sample sets and produce a matrix of square of Euclidean or Mahalanobis distances. The code is fully optimized by vectorization. Therefore it is much faster than the builtin function pdist.
When two matrices A and B are provided as input, this function computes the square Euclidean distances between them. If an extra positive definite matrix M is provided, it computes Mahalanobis distances.
If only one matrix A is provided, the function computes pairwise square Euclidean distances between vectors in A. In this case, it is equivalent to the square of pdist function in matlab statistics toolbox but much faster.
Sample code:
d=1000;n1=5000;n2=6000;
A=rand(d,n1);B=rand(d,n2);
M=rand(d,d);M=M*M'+eye(d);
D1=sqdist(A,B);
D2=sqdist(A);
D3=sqdist(A,B,M);
Detail explanation can be found in following blog post:
http://statinfer.wordpress.com/2011/11/14/efficientmatlabipairwisedistances/
This function is now a part of the PRML toolbox (http://www.mathworks.com/matlabcentral/fileexchange/55826patternrecognitionandmachinelearningtoolbox).
1.9  update description 

1.9  minor tweak 

1.9  Cleaning up 

1.6  update description 

1.5  update title and description 

1.4  remove any redundant error check 

1.3  update to support Mahalanobis distance. fix a bug for one dimensional case. 

1.2  Add a centerization step for robustness purpose. Split the code for different number of input for efficiency purpose. Update comments. 

1.1  update the description 
Inspired by: Pattern Recognition and Machine Learning Toolbox
Phu Lai (view profile)
Hi,
Do you have any suggestion of performing this function on a large matrix? Let's say it a 400000x4 matrix. I could not use this function on my large matrix due to exceeding array size limit.
Thanks.
Fabio (view profile)
hi, I have been using this convenient function for some time and I hit a bug I cannot understand: in certain cases some values of the squared distances matrix get very small but negative values, which is nasty because taking the sqrt delivers imaginary numbers, quite annoying if you expect e.g. an euclidean distance.
If you are interested, I can provide you the actual matrices of points from which this bug originates.
Cheers
Yuan (view profile)
muthu samy (view profile)
Michal Kvasnicka (view profile)
In MATLAB R2013a not anymore "much much faster" than pdist.
MyoungChul Park (view profile)
I use this function to calculate Euclidean Distance Matrix. It is accurate and fast. Some people complain because the function returns actually squared Euclidean distances. However, I think squared distances are more useful in theory (i.e. Euclidean Distance Matrix). Taking square root of EDM will be much faster than calculating actual distances using 'for' loop.
Joao Henriques (view profile)
Meh, this code is fine. Works as advertised. I don't think it is correct to pick on someone just because of some reasonable optimizations. You could've saved all the talk, point out that this returns squared distances and thus should be renamed, and the author would surely comply with your request.
Mo Chen (view profile)
One more word for input verification, you can not check every aspects of the inputs. For example, checking whether the input matirx is positive definite in this code is just crazy which will cost more time than the function itself. One must end up at some point between checking everything and checking nothing, which is a design desicion the coder should make.
In such a simple code, i dont want nasty guarding code, which be even longer than the main funcional code, distracting the reader's attention.
Mo Chen (view profile)
By the way, reading you review reminds me some review comments of some of my papers. Some reviewers just like to focus on whether the formate is right, whether the citation is right even whether the spell is right but not the idea of the paper itself. That is realy a pity.
Mo Chen (view profile)
If you have read the code STL of C++, you will find there is little if statements, and almost no runtime check for input. That is because it reduces the efficiency.
It is more severe in matlab, when you call a function a lot of times, such as kmeans (try to add some 'if' in the inner loop you will see). There are different programming styles: 1) make a lot redundant check in the beginning of every function which sacrify efficiency for robustness; 2) just make sure it works right when input is right. Assuring input correctness is the caller's resonsibility.
These different styles are just trading off between robustness and efficiency. You prefer the robust one, that's fine, just do it your way. I like my way better.
I'm trying provide some functionality and idea here. When I read other's code, I always feel those redundant inputs verification are misleading, which makes it hard to see what the function is acualling doing. If some one (like you) wants to get some safty, I'm sure it is a easy jod for him to add the 'if's himself, the code is there after all. But I like my code to be clean.
John D'Errico (view profile)
Improving. It now runs as claimed for the 1d case, so the bug is removed. I also like that this is called sqdistance. The problem with calling it distance is that it was not a distance. Better is to make that clear, that this returns the square of the Euclidean distance. So this is also good.
There is now even an attempt at error checking, in that the author uses the assert function to flag problems. However, assert allows you to provide TWO arguments. See what happens when I call sqdistance with improperly sized arrays:
sqdistance(rand(3,4),rand(2))
??? Error using ==> sqdistance at 16
Assertion failed.
All it tells me is "Assertion failed". For gods sake, what assertion? Use the second argument! Allow the code to exit gracefully and descriptively. Tell the user that the arguments are incompatible in size for this operation. To just tell them "Assertion failed" is silly. Why bother to do so?
Obviously from the authors last comment, this is just a code for his own academic purposes. So apparently it needs not be any good, or do what he claims it does. I'll argue that he is wrong.
When you post something on a website like this, hundreds or even many thousands of MATLAB users may use your code, or try to do so. They may look at it, hoping to learn something from what you post. So I'm sorry, but it would be a disservice on my part to any MATLAB user or student for me NOT to tell them that I see a problem, and what is wrong, and if possible, how the problem should best have been resolved (in my opinion.) And if this author has improved his code or coding style because of what I show to be wrong, then I've helped him too.
Mo Chen (view profile)
If you want the Euclidean distance itself, nobody prevents you from taking a simple sqrt on top of this function, it wont cost you a second. On the other hand, there are a lot of situations that the square distance is required (or sufficient) not the distance, such as KNN, Kmeans, Spherical Gaussian density, etc.
This is just code for academic purpose, if you feel helpful, just use it where it is suitable. I'm not making some industry product, so give me a break.
John D'Errico (view profile)
Somewhat better now in ONE respect. However, just because you have never suffered from the extreme case I showed does not mean that you have never had the problem. You've just never noticed it, or even known to look.
There is still a problem. Perhaps my long comments before were not extensive enough. This is NOT actually a valid distance metric. NOT. NOT. Read the definition of a distance metric. Here are a few sites:
http://www.mathreference.com/topms,dm.html
http://en.wikipedia.org/wiki/Metric_space
See that one requirement for a valid distance metric is the triangle inequality. The triangle inequality states that
D(x,y) <= D(x,z) + D(y,z)
Does distance satisfy that? TRY IT!
distance(1,5)
ans =
16
distance(1,3) + distance(5,3)
ans =
8
(distance(1,3) + distance(5,3)) > distance(1,5)
ans =
0
So this function fails to satisfy one of the basic requirements for a distance metric. This does NOT compute a distance. Just because you call the function distance does not make it a distance. The failure to compute the sqrt makes this not a valid distance.
Next, this function fails to work properly in one dimension!
distance([1,2])
ans =
0.5 1.5
1.5 0.5
I would have expected to see the matrix [0 1;1 0] returned as the result. (Surely you will concede this fact.) Don't complain that it was my suggested change that caused it to fail, because the original code fails too here, and by a larger margin.
originaldistance([1,2])
ans =
8 6
6 2
Finally, the changes to the help made it more accurate, but still fail to describe the behavior of this code. The help now states that when when called with two arguments, it returns the square of the interpoint Euclidean distances between columns of the matrices. It does not state at all what happens when called with only one argument.
Looking slightly deeper at the code pointed out serious flaws still. My rating is verging closer to one star at this point.
Mo Chen (view profile)
The speed gain is not that this code does not compute sqrt but that it has no for loops, which is the main purpose of this function: demostrating how to vectorized the code in such scenarios.
The suggestion of robustness is valuable, lthough I've never suffer the extreme case in my practice,
The code is updated to include the centerization step and fix the outdated comments.
John D'Errico (view profile)
The problem with this code is it is potentially inaccurate. It uses an identity that people are oh so impressed with, but squares the numbers unnecessarily. I'll talk about this problem eventually, but first, let me discuss another major flaw with this code.
It does not actually compute the Euclidean distance. It computes the square of that distance. As such, it has the wrong units. Don't forget that if your data has units miles, then all "distances" produced by this function will have units of miles squared.
Note that this code claims to produce the same numbers as does the pdist function. But it does not. pdist produces TRUE distances, not the square of the distance. There is a difference.
A = randn(3,5);
B = randn(3,4);
distance(A,B)
ans =
2.6333 7.7299 4.5349 4.6107
3.1854 4.636 9.1202 1.0953
2.277 9.2126 3.9594 9.9298
5.9439 1.416 6.8336 2.748
3.1344 0.96642 7.4918 1.6271
The true interpoint distances, as one will normally see by any correct tool available to compute distance is closer to this:
sqrt(distance(A,B))
ans =
1.6228 2.7803 2.1295 2.1472
1.7848 2.1531 3.02 1.0466
1.509 3.0352 1.9898 3.1512
2.438 1.19 2.6141 1.6577
1.7704 0.98306 2.7371 1.2756
I imagine the author will claim that by not computing the square root, it saves time. I suppose so, but to then claim that the result is a distance as most expect to see would be misleading. And surely to claim that it produces the same as other codes is very misleading.
A nice property for a distance measure to have is linearity. We would like to see that
distance(k*A,k*B) = k*distance(A,B)
It is something that pdist will give you. But this is not a property one will gain from distance.
On to the other problem, the one that I see as most damaging. Recall the results of the above experiment for distance(A,B). Now, try adding any fixed offset to both A and B. I'll add a large one to the matrices to show that complete trash is generated.
distance(A+1e8,B+1e8)
ans =
8 8 8 8
4 4 4 4
4 4 4 12
8 0 8 0
0 0 8 0
See that NO significant digits remain in the result.
A higher quality tool (like pdist for example) will survive even this extreme test quite nicely. As it turns out, this is not an extreme test for pdist. (Since pdist does only interpoint distances for a single matrix, these numbers will not be comparable. See only that adding 1e8 did not change any digits printed out.)
pdist(A'+1e8)
ans =
Columns 1 through 8
1.9891 1.8594 3.139 2.8911 2.7152 2.6316 1.5634 3.9224
Columns 9 through 10
3.1472 1.7458
pdist(A')
ans =
Columns 1 through 8
1.9891 1.8594 3.139 2.8911 2.7152 2.6316 1.5634 3.9224
Columns 9 through 10
3.1472 1.7458
I will point out that all of the above mentioned flaws could have been repaired with a few simple changes to the code. Had the code been written as follows, it would take only about 20% longer to execute in my quick test, but it would have worked properly and robustly.
if nargin == 1
B = A;
mu = mean(A,2);
else
mu = (mean(A,2) + mean(B,2))/2;
end
A = bsxfun(@minus,A,mu);
B = bsxfun(@minus,B,mu);
D = full(2*A'*B);
D = bsxfun(@plus,D,full(sum(B.^2)));
D = sqrt(bsxfun(@plus,D,full(sum(A.^2)')));
Finally, the help for this code is itself wrong. Here is what it says when you do "help distance".
Compute distances between all sample pairs
X: d x n data matrix
D: n x n pairwise distance matrix
Written by Mo Chen (mochen@ie.cuhk.edu.hk). March 2009.
See that it tells you that X is a d by n matrix of data. Does it tell you that the function actually accepts TWO arguments? That if there are two arguments, then it computes distances between all pairs of columns of the two matrices?
No error checks to verify the data actually conforms for distance computation. I'll be generous and give this a 2 star rating.