Code covered by the BSD License  

Highlights from
clusterData

4.28571

4.3 | 7 ratings Rate this file 33 Downloads (last 30 days) File Size: 3.43 KB File ID: #35014
image thumbnail

clusterData

by

 

27 Jun 2012 (Updated )

Clusters an MxN array of data into an unspecified number (P) of bins.

| Watch this File

File Information
Description

No a priori knowledge of the number of bins, or the distance between bins, is required. This approach relies on the relative difference between (sorted) elements of the data, and works well when the difference between clusters is bigger than the difference between elements within a cluster.
 
SYNTAX:
CLUSTERS = clusterData(DATA);

Operates column-by-column. An optional input allows you to specify the sensitivity of each columnwise clustering. Additional outputs also specify the indices of the cluster each row of data, and the bounds used to separate them.

Each column may have a different interpretation. For instance, an Mx4 array of data may represent x-data in the first column, y- in the second, z- in the third, and t- in the fourth. Returns a Px1 cell array, CLUSTERS, specifying the data points in each of the P clusters detected.

The final clustering utilizes all columns.

NOTE: This submission incorporates, expands, and replaces my earlier submission ezCluster.

MATLAB release MATLAB 7.13 (R2011b)
Other requirements Should be Toolbox and platform independent.
Tags for This File   Please login to tag files.
Please login to add a comment or rating.
Comments and Ratings (16)
22 Oct 2013 tsan toso

Ah thanks for the catch Brett, just got around to run the code.

21 Oct 2013 Brett Shoelson

Hi Tsan,

I didn’t spend a lot of time trying to understand your data, but I did manage to cluster them in less than 1 second, using clusterData. I noticed that your column 2 isn’t fully filled out. I think that’s why you’re seeing the long delay when you include column 2. If you were to exclude the pairs with missing values, it would process a lot faster. (In fact, I’m not sure how I treated missing variables. Maybe as NaNs.)

Let me know if the clustering you get with

[clusters,clusterInds,clusterBounds] = clusterData(Binningbydensity(1:3216,:));

works for you. (Those are the rows without missing column-two values.)

Cheers,
Brett

20 Oct 2013 tsan toso

Hi Brett,

If I use your suggested method would it just group data together based on densities and not consider the relative distance of the data between each other? For example let’s just say the data ranges from 1 to 10. The observations of 1 are the same as 10. Observations in between are markedly different, would your function then just put 1 & 10 in the same bin?

For my purposes, I would just want to group bins that are adjacent of the same/similar density together.

I also included a web link for my data just to give you an idea of what kind of data I am dealing with. I provided 2 cols, each is a different random variable.

https://docs.google.com/spreadsheet/ccc?key=0Anv9v54gTjMedGtiRW5fanFRUFBOcW4xUTJ4NHFWbFE&usp=drive_web#gid=0

Another question is that for the dataset on the 2nd column it seem to run for a particularly long time, the data are just integers centering around 1 with dispersion to as far as 7, any work around for that?

Thanks.

19 Oct 2013 Brett Shoelson

@tsan: Hi Tsan,
It's difficult to comment without seeing your data, but it sounds like you could just create and analyze a vector of densities. ClusterData will spit out the indices for the groupings. (You may need to tweak the sensitivity.)
Cheers,
Brett

19 Oct 2013 tsan toso

Hi Brett,

Great code, I got a question on how I could use the code for my purposes:

How would you recommend I could use the code if I am looking to bin a sample of data together based on its density (Number of Occurrence/ Length of edge). And the length of the edges are determined by if the adjacent data groups have similar density. (Similar density are grouped together, but if the neighboring bin is 40% more or less in density, it would require another bin).

It seems like what your code is doing is grouping data based on how close they are to each other.

Thanks.

23 Jul 2013 Hoi Wong

Haha. My data set is supposed to give me an array of numbers, but sometimes I got a singleton. That's how I found out. By the way, excellent submission!

23 Jul 2013 Hoi Wong  
20 Jul 2013 Xiong

thank you for your submission!

12 Jul 2013 Brett Shoelson

@Hoi,
Hmmm. Well, that's clearly a "bug" in the sense that I could have dealt with that case more gracefully, but then--well, let's just say that I never anticipated that anyone would try to cluster a single scalar. :)

11 Jul 2013 Hoi Wong

It seems like the program get stuck (running forever) when I try to cluster a singleton, say clusterData(3).

25 Jun 2013 Brett Shoelson

Han, did you find some problem with the submission that led you to rate this so poorly? Do you have any comments to share that might help me understand why it merits a two-star rating?
Thanks,
Brett

25 Jun 2013 Han  
13 May 2013 Deanna  
11 May 2013 Joel

Excellent submission

18 Sep 2012 Venkat R

Very cool submission. I was searching different options to kind 'k' automatically in the k-means. This submission does it nicely.

10 Aug 2012 Brett Shoelson

PLEASE NOTE that this code uses tildes for argument placeholders. As such, it will not work without modification on releases prior to R2009b. Feel free to edit the code, or upgrade to a newer MATLAB!!!

Updates
10 Jun 2013

Modified the help to correct a doc bug. Higher sensitivity results in fewer clusters, not more. (No code change.)

Contact us