If the 8 variables are real-valued (i.e. no category within
the set) then you might want to consider any algorithm for
classification.
For example, look for the "cluster" keyword in the File
Exchange.
Regards
Carlos
"Daniel " <daniel4738@hotmail.com> wrote in message
<fq0ple$qos$1@fred.mathworks.com>...
> I have a problem I can't seem to find the solution to. It's
> relatively easy.
>
> I have a collection of 25000 observations of 8 variables.
>
> I want to find entries which are similar to each other.
> There must be an easy way, can someone perhaps suggest
> something?
>
> i.e. each entry is a galaxy with 8 parameters, I want to
> find a galaxy which has similar properties to one I select.
It depends on what you mean by 'similar' - you need to
decide an appropriate metric. Similar = small Euclidean
distance would probably be the simplest. See
doc pdist
"Daniel " <daniel4738@hotmail.com> wrote in message
<fq0ple$qos$1@fred.mathworks.com>...
> I have a problem I can't seem to find the solution to. It's
> relatively easy.
>
> I have a collection of 25000 observations of 8 variables.
>
> I want to find entries which are similar to each other.
> There must be an easy way, can someone perhaps suggest
> something?
>
> i.e. each entry is a galaxy with 8 parameters, I want to
> find a galaxy which has similar properties to one I select.
The simple solution is to compute an interpoint
distance matrix. There are several such tools on
the file exchange, or use pdist from the stats TB.
But these will fail on a 25000 point set.
I've written a code that allows you to find only
those distances below some limit, or only the
single nearest neighbor. I'd been planning on
putting it on the file exchange when I got a
round tuit. I'll do so today. E-mail me if you
want it sooner.
"John D'Errico" <woodchips@rochester.rr.com> wrote in
message <fq17qb$2b2$1@fred.mathworks.com>...
> "Daniel " <daniel4738@hotmail.com> wrote in message
> <fq0ple$qos$1@fred.mathworks.com>...
> > I have a problem I can't seem to find the solution to.
It's
> > relatively easy.
> >
> > I have a collection of 25000 observations of 8
variables.
> >
> > I want to find entries which are similar to each other.
> > There must be an easy way, can someone perhaps suggest
> > something?
> >
> > i.e. each entry is a galaxy with 8 parameters, I want
to
> > find a galaxy which has similar properties to one I
select.
>
> The simple solution is to compute an interpoint
> distance matrix. There are several such tools on
> the file exchange, or use pdist from the stats TB.
> But these will fail on a 25000 point set.
>
> I've written a code that allows you to find only
> those distances below some limit, or only the
> single nearest neighbor. I'd been planning on
> putting it on the file exchange when I got a
> round tuit. I'll do so today. E-mail me if you
> want it sooner.
>
> John
If you only need the similarity to the one you have
selected, then you don't need the interpoint distance
matrix. Distance to the selected one should be enough:
If x is the 25000 by 8 data and myx is the selected (1 by 8)
dist = bsxfun(@minus,x,myx);
% euclidean distance for example
Ed = sqrt(sum(dist.^2,2));
Then it is up to you to select what is close enough for
you, [sEd,ind] = sort(Ed); and pick the small ones..
Another solution is to do k-means clustering, no need to
the huge interpoint distance matrix. Included in statistics
toolbox. If you don't have that there is k-means available
also for free at least in SOM toolbox
www.cis.hut.fi/projects/somtoolbox/
But even if two points are in same cluster, they are not
necessarily very similar, you would still need to calculate
the similarity somehow, so I would go for direct distance
measure..
"Pekka " <pekka.nospam.kumpulainen@tut.please.fi> wrote in message
<fq1k3m$jl$1@fred.mathworks.com>...
> "John D'Errico" <woodchips@rochester.rr.com> wrote in
> message <fq17qb$2b2$1@fred.mathworks.com>...
> > "Daniel " <daniel4738@hotmail.com> wrote in message
> > <fq0ple$qos$1@fred.mathworks.com>...
> > > I have a problem I can't seem to find the solution to.
> It's
> > > relatively easy.
> > >
> > > I have a collection of 25000 observations of 8
> variables.
> > >
> > > I want to find entries which are similar to each other.
> > > There must be an easy way, can someone perhaps suggest
> > > something?
Snip)
> If you only need the similarity to the one you have
> selected, then you don't need the interpoint distance
> matrix.
But thats the thing. The OP IS asking to find
entries that are similar to each other, not similar
to a given point.
This does require more work. Its also why I
posted ipdm on the file exchange today, as
it can find only some restricted subset of
points, such as the 1000 closest points, or
only those with a distance less than some
limiting value, or only the nearest neighbor
to each point.
The main difficulty of the problem is that to fully store
distances of all pairs requires more than 2 Gb memory:
(25000*24999)/2*8=2.4999e9. However, the original problem
is to find similarity among all pairs. Hence we can
calulate distances (even without using sqrt) for one
observation against all others then judge which pairs are
similar according to certain criterion to only store
indices of those similar. For example:
X=randn(25000,8);
P=cell(25000,1);
tol=0.1;
tic
for k=1:25000
x=sum((X(k+zeros(25000,1),:)-X).^2,2);
P{k}=find(x<tol);
end
toc
whos P
Elapsed time is 250.986904 seconds
Name Size Bytes Class Attributes
P 1x25000 1700048 cell
In this example, P only takes about 1.7 Mb memory, but
includes indices for all similar pairs.
"Yi Cao" <y.cao@cranfield.ac.uk> wrote in message
<fq1s5q$oot$1@fred.mathworks.com>...
> The main difficulty of the problem is that to fully store
> distances of all pairs requires more than 2 Gb memory:
> (25000*24999)/2*8=2.4999e9.
No need to compute distance for all pairs, Delaunay
triangulation is the trick.
"Bruno Luong" <b.luong@fogale.fr> wrote in message
<fq1t1t$9ql$1@fred.mathworks.com>...
> "Yi Cao" <y.cao@cranfield.ac.uk> wrote in message
> <fq1s5q$oot$1@fred.mathworks.com>...
> > The main difficulty of the problem is that to fully store
> > distances of all pairs requires more than 2 Gb memory:
> > (25000*24999)/2*8=2.4999e9.
>
> No need to compute distance for all pairs, Delaunay
> triangulation is the trick.
>
> Bruno
I want to find out similarity between all pairs; Please tell
me how to implement Delaunay triangulation in Matlab?
Public Submission Policy
NOTICE: Any content you submit to MATLAB Central, including personal information, is not subject to the protections which may be afforded information collected under other sections of The MathWorks, Inc. Web site. You are entirely responsible for
all content that you upload, post, e-mail, transmit or otherwise make available via MATLAB Central. The MathWorks does not control the content posted by visitors to MATLAB Central and, does not guarantee the accuracy, integrity, or quality of such content.
Under no circumstances will The MathWorks be liable in any way for any content not authored by The MathWorks, or any loss or damage of any kind incurred as a result of the use of any content posted, e-mailed, transmitted or otherwise made available
via MATLAB Central. Read the complete Disclaimer prior to use.