<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/164553</link>
    <title>MATLAB Central Newsreader - Finding similar entries</title>
    <description>Feed for thread: Finding similar entries</description>
    <language>en-us</language>
    <copyright>&amp;copy;1994-2008 by The MathWorks, Inc.</copyright>
    <webmaster>webmaster@mathworks.com</webmaster>
    <generator>MATLAB Central Newsreader</generator>
    <docs>http://blogs.law.harvard.edu/tech/rss</docs>
    <ttl>60</ttl>
    <image>
      <title>The MathWorks</title>
      <url>http://www.mathworks.com/images/membrane_icon.gif</url>
    </image>
    <item>
      <pubDate>Tue, 26 Feb 2008 10:29:02 -0500</pubDate>
      <title>Finding similar entries</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/164553#417343</link>
      <author>Daniel </author>
      <description>I have a problem I can't seem to find the solution to. It's&lt;br&gt;
relatively easy.&lt;br&gt;
&lt;br&gt;
I have a collection of 25000 observations of 8 variables.&lt;br&gt;
&lt;br&gt;
I want to find entries which are similar to each other.&lt;br&gt;
There must be an easy way, can someone perhaps suggest&lt;br&gt;
something?&lt;br&gt;
&lt;br&gt;
i.e. each entry is a galaxy with 8 parameters, I want to&lt;br&gt;
find a galaxy which has similar properties to one I select.&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Tue, 26 Feb 2008 11:56:01 -0500</pubDate>
      <title>Re: Finding similar entries</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/164553#417364</link>
      <author>carlos lopez</author>
      <description>If the 8 variables are real-valued (i.e. no category within&lt;br&gt;
the set) then you might want to consider any algorithm for&lt;br&gt;
classification.&lt;br&gt;
For example, look for the "cluster" keyword in the File&lt;br&gt;
Exchange.&lt;br&gt;
Regards&lt;br&gt;
Carlos&lt;br&gt;
&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Tue, 26 Feb 2008 12:01:03 -0500</pubDate>
      <title>Re: Finding similar entries</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/164553#417367</link>
      <author>Simon Preston</author>
      <description>"Daniel " &amp;lt;daniel4738@hotmail.com&amp;gt; wrote in message&lt;br&gt;
&amp;lt;fq0ple$qos$1@fred.mathworks.com&amp;gt;...&lt;br&gt;
&amp;gt; I have a problem I can't seem to find the solution to. It's&lt;br&gt;
&amp;gt; relatively easy.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; I have a collection of 25000 observations of 8 variables.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; I want to find entries which are similar to each other.&lt;br&gt;
&amp;gt; There must be an easy way, can someone perhaps suggest&lt;br&gt;
&amp;gt; something?&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; i.e. each entry is a galaxy with 8 parameters, I want to&lt;br&gt;
&amp;gt; find a galaxy which has similar properties to one I select.&lt;br&gt;
&lt;br&gt;
It depends on what you mean by 'similar' - you need to&lt;br&gt;
decide an appropriate metric.  Similar = small Euclidean&lt;br&gt;
distance would probably be the simplest.  See&lt;br&gt;
doc pdist&lt;br&gt;
&lt;br&gt;
Best wishes, S&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Tue, 26 Feb 2008 14:30:36 -0500</pubDate>
      <title>Re: Finding similar entries</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/164553#417408</link>
      <author>John D'Errico</author>
      <description>"Daniel " &amp;lt;daniel4738@hotmail.com&amp;gt; wrote in message &lt;br&gt;
&amp;lt;fq0ple$qos$1@fred.mathworks.com&amp;gt;...&lt;br&gt;
&amp;gt; I have a problem I can't seem to find the solution to. It's&lt;br&gt;
&amp;gt; relatively easy.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; I have a collection of 25000 observations of 8 variables.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; I want to find entries which are similar to each other.&lt;br&gt;
&amp;gt; There must be an easy way, can someone perhaps suggest&lt;br&gt;
&amp;gt; something?&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; i.e. each entry is a galaxy with 8 parameters, I want to&lt;br&gt;
&amp;gt; find a galaxy which has similar properties to one I select.&lt;br&gt;
&lt;br&gt;
The simple solution is to compute an interpoint&lt;br&gt;
distance matrix. There are several such tools on&lt;br&gt;
the file exchange, or use pdist from the stats TB.&lt;br&gt;
But these will fail on a 25000 point set.&lt;br&gt;
&lt;br&gt;
I've written a code that allows you to find only&lt;br&gt;
those distances below some limit, or only the&lt;br&gt;
single nearest neighbor. I'd been planning on&lt;br&gt;
putting it on the file exchange when I got a&lt;br&gt;
round tuit. I'll do so today. E-mail me if you&lt;br&gt;
want it sooner.&lt;br&gt;
&lt;br&gt;
John&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Tue, 26 Feb 2008 18:00:22 -0500</pubDate>
      <title>Re: Finding similar entries</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/164553#417478</link>
      <author>Pekka </author>
      <description>"John D'Errico" &amp;lt;woodchips@rochester.rr.com&amp;gt; wrote in &lt;br&gt;
message &amp;lt;fq17qb$2b2$1@fred.mathworks.com&amp;gt;...&lt;br&gt;
&amp;gt; "Daniel " &amp;lt;daniel4738@hotmail.com&amp;gt; wrote in message &lt;br&gt;
&amp;gt; &amp;lt;fq0ple$qos$1@fred.mathworks.com&amp;gt;...&lt;br&gt;
&amp;gt; &amp;gt; I have a problem I can't seem to find the solution to. &lt;br&gt;
It's&lt;br&gt;
&amp;gt; &amp;gt; relatively easy.&lt;br&gt;
&amp;gt; &amp;gt; &lt;br&gt;
&amp;gt; &amp;gt; I have a collection of 25000 observations of 8 &lt;br&gt;
variables.&lt;br&gt;
&amp;gt; &amp;gt; &lt;br&gt;
&amp;gt; &amp;gt; I want to find entries which are similar to each other.&lt;br&gt;
&amp;gt; &amp;gt; There must be an easy way, can someone perhaps suggest&lt;br&gt;
&amp;gt; &amp;gt; something?&lt;br&gt;
&amp;gt; &amp;gt; &lt;br&gt;
&amp;gt; &amp;gt; i.e. each entry is a galaxy with 8 parameters, I want &lt;br&gt;
to&lt;br&gt;
&amp;gt; &amp;gt; find a galaxy which has similar properties to one I &lt;br&gt;
select.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; The simple solution is to compute an interpoint&lt;br&gt;
&amp;gt; distance matrix. There are several such tools on&lt;br&gt;
&amp;gt; the file exchange, or use pdist from the stats TB.&lt;br&gt;
&amp;gt; But these will fail on a 25000 point set.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; I've written a code that allows you to find only&lt;br&gt;
&amp;gt; those distances below some limit, or only the&lt;br&gt;
&amp;gt; single nearest neighbor. I'd been planning on&lt;br&gt;
&amp;gt; putting it on the file exchange when I got a&lt;br&gt;
&amp;gt; round tuit. I'll do so today. E-mail me if you&lt;br&gt;
&amp;gt; want it sooner.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; John&lt;br&gt;
&lt;br&gt;
If you only need the similarity to the one you have &lt;br&gt;
selected, then you don't need the interpoint distance &lt;br&gt;
matrix. Distance to the selected one should be enough:&lt;br&gt;
If x is the 25000 by 8 data and myx is the selected (1 by 8)&lt;br&gt;
dist = bsxfun(@minus,x,myx); &lt;br&gt;
% euclidean distance for example&lt;br&gt;
Ed = sqrt(sum(dist.^2,2));&lt;br&gt;
Then it is up to you to select what is close enough for &lt;br&gt;
you, [sEd,ind] = sort(Ed); and pick the small ones..&lt;br&gt;
&lt;br&gt;
Another solution is to do k-means clustering, no need to &lt;br&gt;
the huge interpoint distance matrix. Included in statistics &lt;br&gt;
toolbox. If you don't have that there is k-means available &lt;br&gt;
also for free at least in SOM toolbox &lt;br&gt;
www.cis.hut.fi/projects/somtoolbox/ &lt;br&gt;
But even if two points are in same cluster, they are not &lt;br&gt;
necessarily very similar, you would still need to calculate &lt;br&gt;
the similarity somehow, so I would go for direct distance &lt;br&gt;
measure..&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Tue, 26 Feb 2008 19:10:04 -0500</pubDate>
      <title>Re: Finding similar entries</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/164553#417493</link>
      <author>John D'Errico</author>
      <description>"Pekka " &amp;lt;pekka.nospam.kumpulainen@tut.please.fi&amp;gt; wrote in message &lt;br&gt;
&amp;lt;fq1k3m$jl$1@fred.mathworks.com&amp;gt;...&lt;br&gt;
&amp;gt; "John D'Errico" &amp;lt;woodchips@rochester.rr.com&amp;gt; wrote in &lt;br&gt;
&amp;gt; message &amp;lt;fq17qb$2b2$1@fred.mathworks.com&amp;gt;...&lt;br&gt;
&amp;gt; &amp;gt; "Daniel " &amp;lt;daniel4738@hotmail.com&amp;gt; wrote in message &lt;br&gt;
&amp;gt; &amp;gt; &amp;lt;fq0ple$qos$1@fred.mathworks.com&amp;gt;...&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; I have a problem I can't seem to find the solution to. &lt;br&gt;
&amp;gt; It's&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; relatively easy.&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; &lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; I have a collection of 25000 observations of 8 &lt;br&gt;
&amp;gt; variables.&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; &lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; I want to find entries which are similar to each other.&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; There must be an easy way, can someone perhaps suggest&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; something?&lt;br&gt;
&lt;br&gt;
Snip)&lt;br&gt;
&lt;br&gt;
&amp;gt; If you only need the similarity to the one you have &lt;br&gt;
&amp;gt; selected, then you don't need the interpoint distance &lt;br&gt;
&amp;gt; matrix.&lt;br&gt;
&lt;br&gt;
But thats the thing. The OP IS asking to find&lt;br&gt;
entries that are similar to each other, not similar&lt;br&gt;
to a given point.&lt;br&gt;
&lt;br&gt;
This does require more work. Its also why I&lt;br&gt;
posted ipdm on the file exchange today, as&lt;br&gt;
it can find only some restricted subset of&lt;br&gt;
points, such as the 1000 closest points, or&lt;br&gt;
only those with a distance less than some&lt;br&gt;
limiting value, or only the nearest neighbor&lt;br&gt;
to each point.&lt;br&gt;
&lt;br&gt;
John&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Tue, 26 Feb 2008 20:18:02 -0500</pubDate>
      <title>Re: Finding similar entries</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/164553#417521</link>
      <author>Yi Cao</author>
      <description>The main difficulty of the problem is that to fully store &lt;br&gt;
distances of all pairs requires more than 2 Gb memory: &lt;br&gt;
(25000*24999)/2*8=2.4999e9. However, the original problem &lt;br&gt;
is to find similarity among all pairs. Hence we can &lt;br&gt;
calulate distances (even without using sqrt) for one &lt;br&gt;
observation against all others then judge which pairs are &lt;br&gt;
similar according to certain criterion to only store &lt;br&gt;
indices of those similar. For example:&lt;br&gt;
&lt;br&gt;
X=randn(25000,8);&lt;br&gt;
P=cell(25000,1);&lt;br&gt;
tol=0.1;&lt;br&gt;
tic&lt;br&gt;
for k=1:25000&lt;br&gt;
x=sum((X(k+zeros(25000,1),:)-X).^2,2);&lt;br&gt;
P{k}=find(x&amp;lt;tol);&lt;br&gt;
end&lt;br&gt;
toc&lt;br&gt;
whos P&lt;br&gt;
&lt;br&gt;
Elapsed time is 250.986904 seconds&lt;br&gt;
&lt;br&gt;
Name Size     Bytes    Class Attributes&lt;br&gt;
&lt;br&gt;
P    1x25000  1700048  cell&lt;br&gt;
&lt;br&gt;
In this example, P only takes about 1.7 Mb memory, but &lt;br&gt;
includes indices for all similar pairs.&lt;br&gt;
&lt;br&gt;
hth&lt;br&gt;
&lt;br&gt;
Yi&lt;br&gt;
&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Tue, 26 Feb 2008 20:33:02 -0500</pubDate>
      <title>Re: Finding similar entries</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/164553#417526</link>
      <author>Bruno Luong</author>
      <description>"Yi Cao" &amp;lt;y.cao@cranfield.ac.uk&amp;gt; wrote in message&lt;br&gt;
&amp;lt;fq1s5q$oot$1@fred.mathworks.com&amp;gt;...&lt;br&gt;
&amp;gt; The main difficulty of the problem is that to fully store &lt;br&gt;
&amp;gt; distances of all pairs requires more than 2 Gb memory: &lt;br&gt;
&amp;gt; (25000*24999)/2*8=2.4999e9. &lt;br&gt;
&lt;br&gt;
No need to compute distance for all pairs, Delaunay&lt;br&gt;
triangulation is the trick.&lt;br&gt;
&lt;br&gt;
Bruno&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Wed, 21 May 2008 17:15:03 -0400</pubDate>
      <title>Re: Finding similar entries</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/164553#433373</link>
      <author>amit </author>
      <description>"Bruno Luong" &amp;lt;b.luong@fogale.fr&amp;gt; wrote in message&lt;br&gt;
&amp;lt;fq1t1t$9ql$1@fred.mathworks.com&amp;gt;...&lt;br&gt;
&amp;gt; "Yi Cao" &amp;lt;y.cao@cranfield.ac.uk&amp;gt; wrote in message&lt;br&gt;
&amp;gt; &amp;lt;fq1s5q$oot$1@fred.mathworks.com&amp;gt;...&lt;br&gt;
&amp;gt; &amp;gt; The main difficulty of the problem is that to fully store &lt;br&gt;
&amp;gt; &amp;gt; distances of all pairs requires more than 2 Gb memory: &lt;br&gt;
&amp;gt; &amp;gt; (25000*24999)/2*8=2.4999e9. &lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; No need to compute distance for all pairs, Delaunay&lt;br&gt;
&amp;gt; triangulation is the trick.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; Bruno&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
I want to find out similarity between all pairs; Please tell&lt;br&gt;
me how to implement Delaunay triangulation in Matlab?&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Wed, 21 May 2008 17:31:01 -0400</pubDate>
      <title>Re: Finding similar entries</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/164553#433377</link>
      <author>Bruno Luong</author>
      <description>"amit " &amp;lt;amit_tilwankar@yahoo.com&amp;gt; wrote in message&lt;br&gt;
&amp;lt;g11lan$2ps$1@fred.mathworks.com&amp;gt;...&lt;br&gt;
&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; I want to find out similarity between all pairs; Please tell&lt;br&gt;
&amp;gt; me how to implement Delaunay triangulation in Matlab?&lt;br&gt;
&lt;br&gt;
There is ready-in-the-can functions in MATLAB:&lt;br&gt;
&lt;br&gt;
help delaunay&lt;br&gt;
help delaunay3&lt;br&gt;
help delaunayn&lt;br&gt;
&lt;br&gt;
Bruno&lt;br&gt;
</description>
    </item>
  </channel>
</rss>
