View License

Download apps, toolboxes, and other File Exchange content using Add-On Explorer in MATLAB.

» Watch video

Highlights from
Gap statistics

4.0 | 4 ratings Rate this file 13 Downloads (last 30 days) File Size: 6.93 KB File ID: #37905 Version: 1.4
image thumbnail

Gap statistics



24 Aug 2012 (Updated )

Algorithm for cluster validity index, R. Tibshirani et al. 2001

| Watch this File

File Information

Gap statistic is a method used to estimate the most possible number of clusters in a partition clustering, e.g. k-means clustering (but consider more robust clustering). This measurement was originated by Trevor Hastie, Robert Tibshirani, and Guenther Walther, all from Standford University.

I posted here since I haven't found any Gapstatistics implementation to validate my code, therefore feel free to report bugs and improvements.

Required Products Statistics and Machine Learning Toolbox
MATLAB release MATLAB 7.14 (R2012a)
Tags for This File   Please login to tag files.
Please login to add a comment or rating.
Comments and Ratings (8)
03 Nov 2013 Franck Dernoncourt

Note that the Statistics Toolbox implements the gap statistic as a class in the package clustering.evaluation since R2013b:

Comment only
09 May 2013 kenerlunix Shen

there is a question that when I run the test.m file time is so long that I don't know what problem exists.

02 May 2013 Legato community

Hi, I used this method to compute optional number of clusters, but if I run the program, get for example 4 clusters, when I run it another time, I get 3 clusters on the same signal without any changes... How it is possible? My signal is 10 seconds (5000 samples) od ECQ signal, where are cut only QRS complexes (usually 50 samples).

10 Mar 2013 Alessandro Crimi

Thank you for both your comments.
Regarding the bug xpos and ypos, you were right and I fixed it.
Regarding the distribution, according to the paper the distribution is "uniform", I have no idea what is the difference in this case but now the code is according to the paper

Comment only
12 Jan 2013 Rui

Rui (view profile)

Hi, thank you for sharing the code.
Only comment from me is that the reference distribution you use is multivariate normal distribution, not uniform distribution described in the paper. I am not sure if there is any difference in the result.

04 Jan 2013 Jonathan Harris

Out of curiosity, what is the following code for? I ran into the issue of xpos = [] and noticed this when going through the code. Am I correct in assuming that this is suppose to xpos instead of ypos?

ypos = max(num_clusters);

My original though process in instances like this were to make xpos = NaN and then take the mode of the opt_index matrix after 10+ iterations.

Comment only
11 Sep 2012 Alessandro Crimi

Thank you for your comments,
These are two different problems,
1. The estimation of the dispersion rely (in my code) on the clustering made by k-means which is not the most robust method of the world. So the problem if you simplify the complexity to 10 samples per cluster is based on the classifier, you should substitute kmeans with something more robust (e.g. kmedoids or spectral clustering).
2. For what I have seen Gap statistics works 80% of the case, again because you relay on kmeans or on other randomization. That's why I put 10 iterations and then I show the mean. If you have any idea how to improve this practical aspects (random seed and kmeans), I am welcome to improve it, but referring to the Hastie paper, I implemented exactly as they say.

Comment only
06 Sep 2012 Lee ZY

Lee ZY (view profile)

Hi Crimi, it took me sometime to process the codes.
so i simply changed the test data to

test_datas = rand(10,2)*1;
test_data2 = rand(10,2)*1+10;
test_datas = cat(1,test_datas, test_data2);

and it turned out:
Warning: There is a NaN or Inf among the results
??? Attempted to access xpos(1); index out of bounds because numel(xpos)=0.

Error in ==> gap_statistics at 54
opt_k = num_clusters(xpos(1));

Error in ==> test_gap_statistics at 38
[ opt_index(ii), max_gap(ii)] = gap_statistics(test_datas, num_clusters, num_reference_bootstraps, compactness_as_dispersion);

so i tried again with the original number of test data, which is 200. and i got,

"The max gap is for 3 cluster/s.."

which means the number of clusters it estimates is 3? shouldn't it be 2 based on the test data? please correct me if I'm wrong.

Thank you very much on your valuable time! very much appreciated!

24 Aug 2012 1.1

bug fixed

27 Aug 2012 1.2

better code and communication of some validation

11 Mar 2013 1.4

bug fixed

Contact us