Skip to Main Content Skip to Search
Login
File Exchange
MATLAB Newsgroup
Link Exchange
  Blogs  
 Contest 
MathWorks.com

Thread Subject: Finding similar entries

Subject: Finding similar entries

From: Daniel

Date: 26 Feb, 2008 10:29:02

Message: 1 of 10

I have a problem I can't seem to find the solution to. It's
relatively easy.

I have a collection of 25000 observations of 8 variables.

I want to find entries which are similar to each other.
There must be an easy way, can someone perhaps suggest
something?

i.e. each entry is a galaxy with 8 parameters, I want to
find a galaxy which has similar properties to one I select.

Subject: Re: Finding similar entries

From: carlos lopez

Date: 26 Feb, 2008 11:56:01

Message: 2 of 10

If the 8 variables are real-valued (i.e. no category within
the set) then you might want to consider any algorithm for
classification.
For example, look for the "cluster" keyword in the File
Exchange.
Regards
Carlos

Subject: Re: Finding similar entries

From: Simon Preston

Date: 26 Feb, 2008 12:01:03

Message: 3 of 10

"Daniel " <daniel4738@hotmail.com> wrote in message
<fq0ple$qos$1@fred.mathworks.com>...
> I have a problem I can't seem to find the solution to. It's
> relatively easy.
>
> I have a collection of 25000 observations of 8 variables.
>
> I want to find entries which are similar to each other.
> There must be an easy way, can someone perhaps suggest
> something?
>
> i.e. each entry is a galaxy with 8 parameters, I want to
> find a galaxy which has similar properties to one I select.

It depends on what you mean by 'similar' - you need to
decide an appropriate metric. Similar = small Euclidean
distance would probably be the simplest. See
doc pdist

Best wishes, S

Subject: Re: Finding similar entries

From: John D'Errico

Date: 26 Feb, 2008 14:30:36

Message: 4 of 10

"Daniel " <daniel4738@hotmail.com> wrote in message
<fq0ple$qos$1@fred.mathworks.com>...
> I have a problem I can't seem to find the solution to. It's
> relatively easy.
>
> I have a collection of 25000 observations of 8 variables.
>
> I want to find entries which are similar to each other.
> There must be an easy way, can someone perhaps suggest
> something?
>
> i.e. each entry is a galaxy with 8 parameters, I want to
> find a galaxy which has similar properties to one I select.

The simple solution is to compute an interpoint
distance matrix. There are several such tools on
the file exchange, or use pdist from the stats TB.
But these will fail on a 25000 point set.

I've written a code that allows you to find only
those distances below some limit, or only the
single nearest neighbor. I'd been planning on
putting it on the file exchange when I got a
round tuit. I'll do so today. E-mail me if you
want it sooner.

John

Subject: Re: Finding similar entries

From: Pekka

Date: 26 Feb, 2008 18:00:22

Message: 5 of 10

"John D'Errico" <woodchips@rochester.rr.com> wrote in
message <fq17qb$2b2$1@fred.mathworks.com>...
> "Daniel " <daniel4738@hotmail.com> wrote in message
> <fq0ple$qos$1@fred.mathworks.com>...
> > I have a problem I can't seem to find the solution to.
It's
> > relatively easy.
> >
> > I have a collection of 25000 observations of 8
variables.
> >
> > I want to find entries which are similar to each other.
> > There must be an easy way, can someone perhaps suggest
> > something?
> >
> > i.e. each entry is a galaxy with 8 parameters, I want
to
> > find a galaxy which has similar properties to one I
select.
>
> The simple solution is to compute an interpoint
> distance matrix. There are several such tools on
> the file exchange, or use pdist from the stats TB.
> But these will fail on a 25000 point set.
>
> I've written a code that allows you to find only
> those distances below some limit, or only the
> single nearest neighbor. I'd been planning on
> putting it on the file exchange when I got a
> round tuit. I'll do so today. E-mail me if you
> want it sooner.
>
> John

If you only need the similarity to the one you have
selected, then you don't need the interpoint distance
matrix. Distance to the selected one should be enough:
If x is the 25000 by 8 data and myx is the selected (1 by 8)
dist = bsxfun(@minus,x,myx);
% euclidean distance for example
Ed = sqrt(sum(dist.^2,2));
Then it is up to you to select what is close enough for
you, [sEd,ind] = sort(Ed); and pick the small ones..

Another solution is to do k-means clustering, no need to
the huge interpoint distance matrix. Included in statistics
toolbox. If you don't have that there is k-means available
also for free at least in SOM toolbox
www.cis.hut.fi/projects/somtoolbox/
But even if two points are in same cluster, they are not
necessarily very similar, you would still need to calculate
the similarity somehow, so I would go for direct distance
measure..



Subject: Re: Finding similar entries

From: John D'Errico

Date: 26 Feb, 2008 19:10:04

Message: 6 of 10

"Pekka " <pekka.nospam.kumpulainen@tut.please.fi> wrote in message
<fq1k3m$jl$1@fred.mathworks.com>...
> "John D'Errico" <woodchips@rochester.rr.com> wrote in
> message <fq17qb$2b2$1@fred.mathworks.com>...
> > "Daniel " <daniel4738@hotmail.com> wrote in message
> > <fq0ple$qos$1@fred.mathworks.com>...
> > > I have a problem I can't seem to find the solution to.
> It's
> > > relatively easy.
> > >
> > > I have a collection of 25000 observations of 8
> variables.
> > >
> > > I want to find entries which are similar to each other.
> > > There must be an easy way, can someone perhaps suggest
> > > something?

Snip)

> If you only need the similarity to the one you have
> selected, then you don't need the interpoint distance
> matrix.

But thats the thing. The OP IS asking to find
entries that are similar to each other, not similar
to a given point.

This does require more work. Its also why I
posted ipdm on the file exchange today, as
it can find only some restricted subset of
points, such as the 1000 closest points, or
only those with a distance less than some
limiting value, or only the nearest neighbor
to each point.

John

Subject: Re: Finding similar entries

From: Yi Cao

Date: 26 Feb, 2008 20:18:02

Message: 7 of 10

The main difficulty of the problem is that to fully store
distances of all pairs requires more than 2 Gb memory:
(25000*24999)/2*8=2.4999e9. However, the original problem
is to find similarity among all pairs. Hence we can
calulate distances (even without using sqrt) for one
observation against all others then judge which pairs are
similar according to certain criterion to only store
indices of those similar. For example:

X=randn(25000,8);
P=cell(25000,1);
tol=0.1;
tic
for k=1:25000
x=sum((X(k+zeros(25000,1),:)-X).^2,2);
P{k}=find(x<tol);
end
toc
whos P

Elapsed time is 250.986904 seconds

Name Size Bytes Class Attributes

P 1x25000 1700048 cell

In this example, P only takes about 1.7 Mb memory, but
includes indices for all similar pairs.

hth

Yi

   

Subject: Re: Finding similar entries

From: Bruno Luong

Date: 26 Feb, 2008 20:33:02

Message: 8 of 10

"Yi Cao" <y.cao@cranfield.ac.uk> wrote in message
<fq1s5q$oot$1@fred.mathworks.com>...
> The main difficulty of the problem is that to fully store
> distances of all pairs requires more than 2 Gb memory:
> (25000*24999)/2*8=2.4999e9.

No need to compute distance for all pairs, Delaunay
triangulation is the trick.

Bruno

Subject: Re: Finding similar entries

From: amit

Date: 21 May, 2008 17:15:03

Message: 9 of 10

"Bruno Luong" <b.luong@fogale.fr> wrote in message
<fq1t1t$9ql$1@fred.mathworks.com>...
> "Yi Cao" <y.cao@cranfield.ac.uk> wrote in message
> <fq1s5q$oot$1@fred.mathworks.com>...
> > The main difficulty of the problem is that to fully store
> > distances of all pairs requires more than 2 Gb memory:
> > (25000*24999)/2*8=2.4999e9.
>
> No need to compute distance for all pairs, Delaunay
> triangulation is the trick.
>
> Bruno


I want to find out similarity between all pairs; Please tell
me how to implement Delaunay triangulation in Matlab?

Subject: Re: Finding similar entries

From: Bruno Luong

Date: 21 May, 2008 17:31:01

Message: 10 of 10

"amit " <amit_tilwankar@yahoo.com> wrote in message
<g11lan$2ps$1@fred.mathworks.com>...

>
> I want to find out similarity between all pairs; Please tell
> me how to implement Delaunay triangulation in Matlab?

There is ready-in-the-can functions in MATLAB:

help delaunay
help delaunay3
help delaunayn

Bruno

Tags for this Thread

Everyone's Tags:

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Tag Activity for This Thread
Tag Applied By Date/Time
classification cluster carlos lopez 26 Feb, 2008 07:00:10
rssFeed for this Thread

envelope graphic E-mail this page to a colleague

Public Submission Policy
NOTICE: Any content you submit to MATLAB Central, including personal information, is not subject to the protections which may be afforded information collected under other sections of The MathWorks, Inc. Web site. You are entirely responsible for all content that you upload, post, e-mail, transmit or otherwise make available via MATLAB Central. The MathWorks does not control the content posted by visitors to MATLAB Central and, does not guarantee the accuracy, integrity, or quality of such content. Under no circumstances will The MathWorks be liable in any way for any content not authored by The MathWorks, or any loss or damage of any kind incurred as a result of the use of any content posted, e-mailed, transmitted or otherwise made available via MATLAB Central. Read the complete Disclaimer prior to use.
Related Topics