Gaussian Mixture Models

Introduction

Gaussian mixture models are formed by combining multivariate normal density components. For information on individual multivariate normal densities, see Multivariate Normal Distribution and related distribution functions listed under Multivariate Distributions.

In Statistics Toolbox software, mixture models of the @gmdistribution class are fit to data using an expectation maximization (EM) algorithm, which assigns posterior probabilities to each component density with respect to each observation.

Gaussian mixture models are often used for data clustering. Clusters are assigned by selecting the component that maximizes the posterior probability. Like k-means clustering, Gaussian mixture modeling uses an iterative algorithm that converges to a local optimum. Gaussian mixture modeling may be more appropriate than k-means clustering when clusters have different sizes and correlation within them.

Creation of Gaussian mixture models is described in the Gaussian Mixture Models section of Probability Distributions. This section describes their application in cluster analysis.

Clustering with Gaussian Mixtures

Use the cluster method of the @gmdistribution class to cluster data with Gaussian mixture models. The method takes as input a gmdistribution object obj and a data matrix X. The method assigns a cluster to each observation in X by choosing the component of obj with the largest posterior probability, weighted by the component probability.

The following example illustrates this procedure.

First, generate data from a mixture of two bivariate Gaussian distributions using the mvnrnd function:

MU1 = [1 2];
SIGMA1 = [2 0; 0 .5];
MU2 = [-3 -5];
SIGMA2 = [1 0; 0 1];
X = [mvnrnd(MU1,SIGMA1,1000);mvnrnd(MU2,SIGMA2,1000)];

scatter(X(:,1),X(:,2),10,'.')
hold on

Next, fit a two-component Gaussian mixture model:

options = statset('Display','final');
obj = gmdistribution.fit(X,2,'Options',options);
10 iterations, log-likelihood = -7046.78

h = ezcontour(@(x,y)pdf(obj,[x y]),[-8 6],[-8 6]);

Finally, use the fit to cluster the data:

idx = cluster(obj,X);
cluster1 = X(idx == 1,:);
cluster2 = X(idx == 2,:);

delete(h)
h1 = scatter(cluster1(:,1),cluster1(:,2),10,'r.');
h2 = scatter(cluster2(:,1),cluster2(:,2),10,'g.');
legend([h1 h2],'Cluster 1','Cluster 2','Location','NW')

The posterior method of the @gmdistribution class returns the posterior probabilities for each cluster used to cluster the data:

P = posterior(obj,X);

figure
scatter(X(:,1),X(:,2),10,P(:,1),'.')
hb = colorbar;
ylabel(hb,'Component 1 Probability')

The mahal method of the @gmdistribution class measures the Mahalanobis distance (in squared units) of each observation to the mean of each of the components:

D = mahal(obj,X);

figure
delete(h)
scatter(X(:,1),X(:,2),10,D(:,1),'.')
hb = colorbar;
ylabel(hb,'Mahalanobis Distance to Component 1')

  


 © 1984-2008- The MathWorks, Inc.    -   Site Help   -   Patents   -   Trademarks   -   Privacy Policy   -   Preventing Piracy   -   RSS