Path: news.mathworks.com!not-for-mail
From: "Pekka " <pekka.nospam.kumpulainen@tut.please.fi>
Newsgroups: comp.soft-sys.matlab
Subject: Re: Finding similar entries
Date: Tue, 26 Feb 2008 18:00:22 +0000 (UTC)
Organization: Tampere University of Technology
Lines: 56
Message-ID: <fq1k3m$jl$1@fred.mathworks.com>
References: <fq0ple$qos$1@fred.mathworks.com> <fq17qb$2b2$1@fred.mathworks.com>
Reply-To: "Pekka " <pekka.nospam.kumpulainen@tut.please.fi>
NNTP-Posting-Host: webapp-02-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1204048822 629 172.30.248.37 (26 Feb 2008 18:00:22 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Tue, 26 Feb 2008 18:00:22 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 218565
Xref: news.mathworks.com comp.soft-sys.matlab:453813


"John D'Errico" <woodchips@rochester.rr.com> wrote in 
message <fq17qb$2b2$1@fred.mathworks.com>...
> "Daniel " <daniel4738@hotmail.com> wrote in message 
> <fq0ple$qos$1@fred.mathworks.com>...
> > I have a problem I can't seem to find the solution to. 
It's
> > relatively easy.
> > 
> > I have a collection of 25000 observations of 8 
variables.
> > 
> > I want to find entries which are similar to each other.
> > There must be an easy way, can someone perhaps suggest
> > something?
> > 
> > i.e. each entry is a galaxy with 8 parameters, I want 
to
> > find a galaxy which has similar properties to one I 
select.
> 
> The simple solution is to compute an interpoint
> distance matrix. There are several such tools on
> the file exchange, or use pdist from the stats TB.
> But these will fail on a 25000 point set.
> 
> I've written a code that allows you to find only
> those distances below some limit, or only the
> single nearest neighbor. I'd been planning on
> putting it on the file exchange when I got a
> round tuit. I'll do so today. E-mail me if you
> want it sooner.
> 
> John

If you only need the similarity to the one you have 
selected, then you don't need the interpoint distance 
matrix. Distance to the selected one should be enough:
If x is the 25000 by 8 data and myx is the selected (1 by 8)
dist = bsxfun(@minus,x,myx); 
% euclidean distance for example
Ed = sqrt(sum(dist.^2,2));
Then it is up to you to select what is close enough for 
you, [sEd,ind] = sort(Ed); and pick the small ones..

Another solution is to do k-means clustering, no need to 
the huge interpoint distance matrix. Included in statistics 
toolbox. If you don't have that there is k-means available 
also for free at least in SOM toolbox 
www.cis.hut.fi/projects/somtoolbox/ 
But even if two points are in same cluster, they are not 
necessarily very similar, you would still need to calculate 
the similarity somehow, so I would go for direct distance 
measure..