4.0

4.0 | 3 ratings Rate this file 224 downloads (last 30 days) File Size: 104.7 KB File ID: #18937

IPDM: Inter-Point Distance Matrix

by John D'Errico

 

26 Feb 2008 (Updated 21 Apr 2008)

Code covered by BSD License  

An efficient and accurate Inter-Point Distance Matrix

Download Now | Watch this File

File Information
Description

There are several utilities around to compute interpoint distances, but none of them did fully what I thought important. What were my goals?

1. Inter-point distances are sometimes computed within one set of points, or between two sets. So a tool must handle either case.

2. Efficiency is important, but a common method for inter-point (Euclidean) distances uses a trick that results in a loss of accuracy. The good thing is bsxfun allows us to compute distances both efficiently and accurately.

3. Many times we wish to compute an inter-point, but we only need some subset of the entire matrix. Then it might be nice to have a list of only the single nearest neighbor for each point in our set, or only the large or small distances beyond some limit.

4. Where appropriate, a sparse distance matrix might be useful.

5. Really large problems can sometimes be solved by breaking the problem into smaller chunks. IPDM does this where appropriate.

6. There are many special cases that can be solved efficiently. For example, to find the nearest neighbor for one dimensional data is a simple thing, costing no more than a sort. One does not need to compute all distances if only the closest point is of interest. (There are several other special cases that can be sped up, perhaps using k-d trees, or other algorithms. If interest is seen I'll try to provide them.)

All of these things and more are satisfied by IPDM. It uses a property/value pair interface to specify the options.

Find the nearest neighbors in 1-dimensional data:

A = randn(10000,1);
B = randn(15000,1);

tic,
d=ipdm(A,B,'subset','nearest');
toc

Elapsed time is 0.151346 seconds.

A note to those who might be worried about absolute speed on small sets of data. I've now considerably sped up the code for simple calls, reducing the basic overhead by a factor of roughly 4.

See the demo file for many examples of use.

Acknowledgements

The author wishes to acknowledge the following in the creation of this submission:
distance.m, NEARESTPOINT, nearestneighbour.m, Distance Matrix, Computing Pairwise Distances and Metrics
This submission has inspired the following:
Efficient K-Nearest Neighbor Search using JIT, Experimental (Semi-) Variogram

MATLAB release MATLAB 7.4 (R2007a)
Zip File Content  
Published M Files demo_ipdm
Other Files IPDM/demo_ipdm.m,
IPDM/ipdm.m,
IPDM/html/demo_ipdm.png,
IPDM/ipdm.jpg,
IPDM/html/demo_ipdm_01.png
Tags for This File  
Everyone's Tags
Tags I've Applied
Add New Tags Please login to tag files.
Comments and Ratings (6)
27 Feb 2008 Duane Hanselman

Great submission. If only all submissions were of this quality in terms of documentation, features, and coding. If only submissions like this one were immune to 1 star ratings by vindictive reviewers.

05 Mar 2008 Urs (us) Schwarz

since the cyclomatic complexity of this submission is already 106 (!!) i refrain from any more redundancy and simply add - good stuff, john: help/examples/computational engine
us

05 Mar 2008 Urs (us) Schwarz

since the cyclomatic complexity of this submission is already 106 (!!) i refrain from any more redundancy and simply add - good stuff, john: help/examples/computational engine
us

20 Apr 2008 Jim Stevens

I am very confused. This should be efficient routine, and I replaced my own very naive (not vectorized, loop-in-loop approach) distance calculation with IPDM call, and my code slowed down essentially.
Little play with tic-toc timer gave me surprising results, I calculated distance from one point to several points. With only 10 points naive approach was 7.5 times faster, with 100 points 2.5 times, and around 500 points they were equal.
On the other hand, while calculating whole point matrix (distance from all points to other points) of point set, IPDM will give out of memory error already with 1400 two dimensional points. With naive approach, calcuation of 1300 points takes appr 10 times longer than with IPDM, but on the other hand, naive approach can handle far larger point sets.

So, as there seems to be some cases when this implementation is inferior to naive approach, I dont understand what was the talk about efficiency.

20 Apr 2008 John D'Errico

Yes, for small problems, the naive code can be more efficient. This tool has many options however, and just the task of parsing the possibilities takes a significant amount of time. So for a tiny problem, almost all the work is in the problem overhead. Note that for these small problems, the time required for ipdm is almost constant, roughly 0.0025 seconds.

For example, the following test shows the times required in a simple test. I computed the distances from each of 100 points to a set of N points (N is the first column) in 2 dimensions.

[N',ipdmtimes,looptimes]
ans =
            2 0.0025589 0.0006432
           10 0.0033113 0.00076637
           50 0.0051397 0.0038982
          100 0.0071307 0.0077328
          500 0.02467 0.037465
         1000 0.04776 0.07536
         2000 0.092433 0.15087

Since the minimum overhead on my CPU to run ipdm is roughly 0.0025 seconds, the simple loop will be faster for small problems. As you can see, the loop is slower for the larger problems, anything larger than 100x100 in this case, at least on my CPU. Other tests show that the break-even point was indeed roughly 10,000 total distances to compute for a simple, full distance matrix.

What ipdm provides is the ability to generate a wide variety of distance matrices, especially when you only need some limited subset of the complete set of distances. If you only want a subset of the complete set of distances, but that complete set of distances would be too large to compute in the memory that you have available, then you may have no choice but to use a tool like this.

As far as the issue of memory goes, I get no memory errors for the problem size indicated. This just means that I've got more memory available to my system. However, ipdm also allows you to control the size of the chunks it uses, so Jim could have avoided that problem. The 'ChunkSize' parameter is available as a (currently undocumented) option in ipdm. It was not listed in the help, as I thought it would be of little common utility. Use of the 'ChunkSize' property is currently shown in the examples in the demo file. I'll add a description of 'ChunkSize' to the help for those with limited memory.

12 Aug 2009 Wolfgang Schwanghart

Thanks for this great contribution! Would it, however, be possible that ipdm (in case only data1 is provided) only returns the row and column indices and distances so that they are arranged in the order (distance(rowix,colix)): (2,1), (3,1), ..., (n,1), (3,2), ..., (n,2), ..., (n,n–1)). That is the way pdist (part of the statistics toolbox) returns the distance matrix in a vector to be more memory efficient. pdist, however, does not support distance subsets.

So far, I use this workaround in my code.

% remove distances were d.columnindex = d.rowindex
I = d.columnindex==d.rowindex;
d.columnindex(I) = [];
d.rowindex(I) = [];
d.distance(I) = [];

% remove double entry of pairs
[dtemp, m] = unique(sort([d.rowindex d.columnindex],2),'rows');
d.rowindex = dtemp(:,1);
d.columnindex = dtemp(:,2);
d.distance = d.distance(m);

This reduces the size of d by more than a half. But it doesn't avoid the generation of the huge struct returned by ipdm, which is a real memory bottleneck for me. Would it be possible to calculate this more efficient output within ipdm? So far I don't see a straightforward way, but perhaps you do?

Best regards,
Wolfgang

Please login to add a comment or rating.
Updates
21 Apr 2008

Speed enhancements made for simple calls when no properties are set, also documented the 'ChunkSize' property.

Tag Activity for this File
Tag Applied By Date/Time
interpoint John D'Errico 22 Oct 2008 09:50:43
distance John D'Errico 22 Oct 2008 09:50:43
euclidean John D'Errico 22 Oct 2008 09:50:43
nearest neighbor John D'Errico 22 Oct 2008 09:50:43
metric John D'Errico 22 Oct 2008 09:50:43
 

MATLAB Central Terms of Use

NOTICE: Any content you submit to MATLAB Central, including personal information, is not subject to the protections which may be afforded information collected under other sections of The MathWorks, Inc. Web site. You are entirely responsible for all content that you upload, post, e-mail, transmit or otherwise make available via MATLAB Central. The MathWorks does not control the content posted by visitors to MATLAB Central and, does not guarantee the accuracy, integrity, or quality of such content. Under no circumstances will The MathWorks be liable in any way for any content not authored by The MathWorks, or any loss or damage of any kind incurred as a result of the use of any content posted, e-mailed, transmitted or otherwise made available via MATLAB Central. Read the complete Terms prior to use.

Contact us at files@mathworks.com