File Exchange

image thumbnail

Flexible mixture models for automatic clustering

version 0.40 (61.2 KB) by Statovic
This is a Matlab implementation of clustering (i.e., finite mixture models, unsupervised classification).

16 Downloads

Updated 01 Mar 2021

View Version History

View License

SNOB is a Matlab implementation of finite mixture models. Snob uses the minimum message length criterion to estimate the structure of the mixture model (i.e., the number of sub-populations; which sample belongs to which sub-population) and estimate all mixture model parameters. SNOB allows the user to specify the desired number of sub-populations, however if this is not specified, SNOB will automatically try to discover this information. Currently, SNOB supports mixtures of the following distributions:

-Beta distribution
-Exponential distribution
-Univariate gamma distribution
-Geometric distribution
-Inverse Gaussian distribution
-Univariate Laplace distribution
-Gaussian linear regression
-Logistic regression
-Multinomial distribution
-Multivariate Gaussian distribution
-Negative binomial distribution
-Univariate normal distribution
-Poisson distribution
-Multivariate normal distribution (single factor analysis)
-von Mises-Fisher distribution
-Weibull distribution

The program is easy to use and allows missing data - all missing data should be coded as NaN. Examples of how to use the program are provided; see data/mm_example?.m.

UPDATE VERSION 0.4.0 (01/03/2021):
Latest updates:
-added two new distributions (von Mises Fisher and beta distribution; improved numerical accuracy for high-dimensional VMF mixtures coming up in a later update)
-added two examples of using snob
-improved numerical accuracy overall
-improved documenation

Cite As

Wallace, C. S. & Dowe, D. L. MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions. Statistics and Computing, 2000 , 10, pp. 73-83

Wallace, C. S. Intrinsic Classification of Spatially Correlated Data. The Computer Journal, 1998, 41, pp. 602-611

Wallace, C. S. Statistical and Inductive Inference by Minimum Message Length. Springer, 2005

Schmidt, D. F. & Makalic, E. Minimum Message Length Inference and Mixture Modelling of Inverse Gaussian Distributions. AI 2012: Advances in Artificial Intelligence, Springer Berlin Heidelberg, 2012, 7691, pp. 672-682

Edwards, R. T. & Dowe, D. L. Single factor analysis in MML mixture modelling. Research and Development in Knowledge Discovery and Data Mining, Second Pacific-Asia Conference (PAKDD-98), 1998, 1394

Comments and Ratings (13)

Statovic

Hi Matthew, You can contact me via email at "emakalic@gmail.com". Happy to talk more about your research project. All the best, Enes

Matthew Moore

Hi Enes thanks for your response, unfortunately with the modifications you have suggested the problem still persists. I am trying to fit a range of mixtures to data and compare each fit for various values of K components for a research paper. Is there a better way to contact you to discuss my issue or perhaps we could have a quick chat with myself and my research supervisor. Thanks

Statovic

Hi Matthew,
There are two reasons why some classes can collapse even if 'fixedstructure' is set to true: (1) the initial assignment may not be ideal, and (2) the stochastic re-assignment step during the search process. As for the second reason, we can disable re-assignment of things to classes by adding the following code to "mm_Search.m" at line 34 (right after the call to mm_Reassign):

if(mm.opts.fixedstructure && (mm_r.nClasses < mm.nClasses))
msglen_r = inf; % make sure we dont select this model
end

This ensures that if a re-assignment results in a structure change, we reject that resultant model. The first issues is a little trickier and will require a more substantial change - we will look into this and get back to you.

I hope this helps.
Enes

Matthew Moore

Hi, is there any workaround so I can specify a certain amount of classes without receiving the "removing classes due to insufficient membership" message. I am trying to explore the fits for various mixture distributions ranging from 1-4 components, but for some of my data I am limited to the amount of components I can fit. Is there any modifications to the code I can make to free me of these limitations and fit as many components to the data as I want (even if the data is overfitted). Thanks

Statovic

Hi Matthew,
We will add a nicer function for plotting the CDF of a 1D mixture model in the next update. In the meantime, the following should work. Assuming you have a mixture model called "mm" and the first column of your data is modelled by a univariate Gaussian, the following should plot the CDF:

mm_example2;
clf;
N = 1e5;
wClass = mnrnd(1,mm.a,N);
x=zeros(N,1);
for i = 1:N, K = find(wClass(i,:)); theta = mm.class{K}.model{1}.theta; x(i) = normrnd(theta(1),sqrt(theta(2))); end
cdfplot(x)

I hope this helps.
Cheers,
Enes

Statovic

Hi Matthew,
Unfortunately there are no functions to plot a CDF at the moment. There is a function "mm_PlotModel1d(.)" which allows plotting of the PDF though (if that is of interest).
Cheers,
Enes

Matthew Moore

Hi, Is there any method or function to plot the CDF of a 1d gaussian mixture model. Ive used snob with 'norm' to calculate the component properties but I am unable to plot a cdf. Thanks

Statovic

Hi Maryam,
Sorry about the late reply. The package uses the optimization toolbox for clustering with some distributions (e.g., Weibull distribution). If you do not have the optimisation toolbox you might still be able use the package for other distributions (e.g., Gaussian and multinomial). You could do this by, for example, setting the variable that triggers the error to an empty array. I hope this helps.
Cheers,
Enes

Maryam Ghahramani

Hi

I keep getting the following error:
Error using optimoptions (line 124)
Empty keys are not allowed in this container.
even while running the examples.

Statovic

Thanks Aishwarya.
By default, the program will try to automatically determine a suitable number of classes based on the data. The option ['k' <int>] allows the user to specify the starting model structure for the search. If you wish to find the best 3 class model, for example, and do not wish to automatically search for the number of classes, you need to use the option ['fixedstructure', true] together with ['k',3].

Lorenzo Puppo

Bryce Grier

Aishwarya Venkatesh

Hello.
Thanks a lot for the great work!! I am using this script for my application, which involves solving the optimization problem for marginal distribution. It seems to be working good.

But I just have a small doubt, in one of example for snob implementation, there is a demo for using '5' and '3' as the name test value pair to determine the optimal clusters. How this has been determined and can is there any possibility to change the name test value pair based on the dataset, or it should be always '5' and '3'. I suppose, since its not hard coded, one can change the name value test pair, but what is the criterion to do it?

MATLAB Release Compatibility
Created with R2019a
Compatible with any release
Platform Compatibility
Windows macOS Linux

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!