An efficient and accurate Inter-Point Distance Matrix
There are several utilities around to compute interpoint distances, but none of them did fully what I thought important. What were my goals?
1. Inter-point distances are sometimes computed within one set of points, or between two sets. So a tool must handle either case.
2. Efficiency is important, but a common method for inter-point (Euclidean) distances uses a trick that results in a loss of accuracy. The good thing is bsxfun allows us to compute distances both efficiently and accurately.
3. Many times we wish to compute an inter-point, but we only need some subset of the entire matrix. Then it might be nice to have a list of only the single nearest neighbor for each point in our set, or only the large or small distances beyond some limit.
4. Where appropriate, a sparse distance matrix might be useful.
5. Really large problems can sometimes be solved by breaking the problem into smaller chunks. IPDM does this where appropriate.
6. There are many special cases that can be solved efficiently. For example, to find the nearest neighbor for one dimensional data is a simple thing, costing no more than a sort. One does not need to compute all distances if only the closest point is of interest. (There are several other special cases that can be sped up, perhaps using k-d trees, or other algorithms. If interest is seen I'll try to provide them.)
All of these things and more are satisfied by IPDM. It uses a property/value pair interface to specify the options.
Find the nearest neighbors in 1-dimensional data:
A = randn(10000,1);
B = randn(15000,1);
tic,
d=ipdm(A,B,'subset','nearest');
toc
Elapsed time is 0.151346 seconds.
A note to those who might be worried about absolute speed on small sets of data. I've now considerably sped up the code for simple calls, reducing the basic overhead by a factor of roughly 4.
See the demo file for many examples of use.
1.1 | Fixed a bug for smallest/largest few cases |
Inspired by: distance.m, nearestpoint(x,y,m), nearestneighbour.m, Distance Matrix, Computing Pairwise Distances and Metrics
Inspired: Efficient K-Nearest Neighbor Search using JIT, near2, peterson-tim-j/Groundwater-Statistics-Toolbox, findclump
Maksim (view profile)
For (squared) Euclidean I was able to get a 100x speedup over this implementation using linear algebra operations. My code is available at github.com/maksimt/matlab_funcs
Jiali (view profile)
It is really efficient. I also like
ipdm([x y],'subset','nearestneighbor','result','struct'); because I can manipulate the structure for later use.
Chad Greene (view profile)
For an array of points given by x and y, this function quite elegantly and efficiently returns indices of each point's nearest neighbor like this:
ipdm([x y],'subset','nearestneighbor','result','struct');
Takes ~5 seconds to solve for 15,000 points on my laptop. That's quite fast.
Raldi (view profile)
Ben (view profile)
Great code. Thanks for your kind share!
oskenund (view profile)
Ryan (view profile)
Really useful, a very nice function that I use all the time.
John D'Errico (view profile)
If you wish to see a histogram of the distances, then just call hist on the output.
But your question seems not to be about that. If you actually want to see the distance between two histograms, then you need to define what the word "distance" means in that case.
Farie (view profile)
excellent...but I wonder can we do this distance on histogram?
Matti Jukola (view profile)
Thank you for this tool. I made few small performance tweaks to input parsing, these should make ipdm more efficient when running multiple times with small datasets:
Near line 292: changed ~ismember... check to:
if ~any(strcmpi({'all','nearestneighbor','farthestneighbor'},params.Subset)) && ...
Near line 326: again changed ~ismember... to:
elseif (length(params.Metric)~=1) || ~any(params.Metric==[0 1 2 inf])
In function 'parse_pv_pairs': changed last portion of the code to:
% there was at least one pv pair. process any supplied
propnames = fieldnames(params);
for i=1:n
p_i = pv_pairs{2*i-1};
v_i = pv_pairs{2*i};
ind = find(strcmpi(propnames,p_i));
if isempty(ind)
ind = find(strncmpi(p_i,propnames,length(p_i)));
if isempty(ind)
error(['No matching property found for: ',p_i])
elseif length(ind)>1
error(['Ambiguous property name: ',p_i])
end
end
% override the corresponding default in params.
% Use setfield for comptability issues with older releases.
params.(propnames{ind}) = v_i;
end
Wolfgang Schwanghart (view profile)
Thanks for this great contribution! Would it, however, be possible that ipdm (in case only data1 is provided) only returns the row and column indices and distances so that they are arranged in the order (distance(rowix,colix)): (2,1), (3,1), ..., (n,1), (3,2), ..., (n,2), ..., (n,n–1)). That is the way pdist (part of the statistics toolbox) returns the distance matrix in a vector to be more memory efficient. pdist, however, does not support distance subsets.
So far, I use this workaround in my code.
% remove distances were d.columnindex = d.rowindex
I = d.columnindex==d.rowindex;
d.columnindex(I) = [];
d.rowindex(I) = [];
d.distance(I) = [];
% remove double entry of pairs
[dtemp, m] = unique(sort([d.rowindex d.columnindex],2),'rows');
d.rowindex = dtemp(:,1);
d.columnindex = dtemp(:,2);
d.distance = d.distance(m);
This reduces the size of d by more than a half. But it doesn't avoid the generation of the huge struct returned by ipdm, which is a real memory bottleneck for me. Would it be possible to calculate this more efficient output within ipdm? So far I don't see a straightforward way, but perhaps you do?
Best regards,
Wolfgang
Yes, for small problems, the naive code can be more efficient. This tool has many options however, and just the task of parsing the possibilities takes a significant amount of time. So for a tiny problem, almost all the work is in the problem overhead. Note that for these small problems, the time required for ipdm is almost constant, roughly 0.0025 seconds.
For example, the following test shows the times required in a simple test. I computed the distances from each of 100 points to a set of N points (N is the first column) in 2 dimensions.
[N',ipdmtimes,looptimes]
ans =
2 0.0025589 0.0006432
10 0.0033113 0.00076637
50 0.0051397 0.0038982
100 0.0071307 0.0077328
500 0.02467 0.037465
1000 0.04776 0.07536
2000 0.092433 0.15087
Since the minimum overhead on my CPU to run ipdm is roughly 0.0025 seconds, the simple loop will be faster for small problems. As you can see, the loop is slower for the larger problems, anything larger than 100x100 in this case, at least on my CPU. Other tests show that the break-even point was indeed roughly 10,000 total distances to compute for a simple, full distance matrix.
What ipdm provides is the ability to generate a wide variety of distance matrices, especially when you only need some limited subset of the complete set of distances. If you only want a subset of the complete set of distances, but that complete set of distances would be too large to compute in the memory that you have available, then you may have no choice but to use a tool like this.
As far as the issue of memory goes, I get no memory errors for the problem size indicated. This just means that I've got more memory available to my system. However, ipdm also allows you to control the size of the chunks it uses, so Jim could have avoided that problem. The 'ChunkSize' parameter is available as a (currently undocumented) option in ipdm. It was not listed in the help, as I thought it would be of little common utility. Use of the 'ChunkSize' property is currently shown in the examples in the demo file. I'll add a description of 'ChunkSize' to the help for those with limited memory.
I am very confused. This should be efficient routine, and I replaced my own very naive (not vectorized, loop-in-loop approach) distance calculation with IPDM call, and my code slowed down essentially.
Little play with tic-toc timer gave me surprising results, I calculated distance from one point to several points. With only 10 points naive approach was 7.5 times faster, with 100 points 2.5 times, and around 500 points they were equal.
On the other hand, while calculating whole point matrix (distance from all points to other points) of point set, IPDM will give out of memory error already with 1400 two dimensional points. With naive approach, calcuation of 1300 points takes appr 10 times longer than with IPDM, but on the other hand, naive approach can handle far larger point sets.
So, as there seems to be some cases when this implementation is inferior to naive approach, I dont understand what was the talk about efficiency.
since the cyclomatic complexity of this submission is already 106 (!!) i refrain from any more redundancy and simply add - good stuff, john: help/examples/computational engine
us
since the cyclomatic complexity of this submission is already 106 (!!) i refrain from any more redundancy and simply add - good stuff, john: help/examples/computational engine
us
Great submission. If only all submissions were of this quality in terms of documentation, features, and coding. If only submissions like this one were immune to 1 star ratings by vindictive reviewers.