Documentation

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English verison of the page.

Note: This page has been translated by MathWorks. Please click here
To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

gmdistribution.fit

Class: gmdistribution

Gaussian mixture parameter estimates

    Note:   fit will be removed in a future release. Use fitgmdist instead.

Syntax

obj = gmdistribution.fit(X,k)
obj = gmdistribution.fit(...,param1,val1,param2,val2,...)

Description

obj = gmdistribution.fit(X,k) uses an Expectation Maximization (EM) algorithm to construct an object obj of the gmdistribution class containing maximum likelihood estimates of the parameters in a Gaussian mixture model with k components for data in the n-by-d matrix X, where n is the number of observations and d is the dimension of the data.

gmdistribution treats NaN values as missing data. Rows of X with NaN values are excluded from the fit.

obj = gmdistribution.fit(...,param1,val1,param2,val2,...) provides control over the iterative EM algorithm. Parameters and values are listed below.

ParameterValue
'Start'

Method used to choose initial component parameters. One of the following:

  • 'randSample' — To select k observations from X at random as initial component means. The mixing proportions are uniform. The initial covariance matrices for all components are diagonal, where the element j on the diagonal is the variance of X(:,j). This is the default.

  • 'plus' — The software selects k observations from X using the kmeans++ algorithm. The initial mixing proportions are uniform. The initial covariance matrices for all components are diagonal, where the element j on the diagonal is the variance of X(:,j).

  • S — A structure array with fields mu, Sigma, and PComponents. See gmdistribution for descriptions of values.

  • s — A vector of length n containing an initial guess of the component index for each point.

'Replicates'

A positive integer giving the number of times to repeat the EM algorithm, each time with a new set of parameters. The solution with the largest likelihood is returned. A value larger than 1 requires the 'randSample' start method. The default is 1.

'CovType'

'diagonal' if the covariance matrices are restricted to be diagonal; 'full' otherwise. The default is 'full'.

'SharedCov'

Logical true if all the covariance matrices are restricted to be the same (pooled estimate); logical false otherwise.

'Regularize'

A nonnegative regularization number added to the diagonal of covariance matrices to make them positive-definite. The default is 0.

'Options'

Options structure for the iterative EM algorithm, as created by statset. gmdistribution.fit uses the parameters 'Display' with a default value of 'off', 'MaxIter' with a default value of 100, and 'TolFun' with a default value of 1e-6.

In some cases, gmdistribution may converge to a solution where one or more of the components has an ill-conditioned or singular covariance matrix.

The following issues may result in an ill-conditioned covariance matrix:

  • The number of dimension of your data is relatively high and there are not enough observations.

  • Some of the features (variables) of your data are highly correlated.

  • Some or all the features are discrete.

  • You tried to fit the data to too many components.

In general, you can avoid getting ill-conditioned covariance matrices by using one of the following precautions:

  • Pre-process your data to remove correlated features.

  • Set 'SharedCov' to true to use an equal covariance matrix for every component.

  • Set 'CovType' to 'diagonal'.

  • Use 'Regularize' to add a very small positive number to the diagonal of every covariance matrix.

  • Try another set of initial values.

In other cases gmdistribution may pass through an intermediate step where one or more of the components has an ill-conditioned covariance matrix. Trying another set of initial values may avoid this issue without altering your data or model.

Examples

Generate data from a mixture of two bivariate Gaussian distributions using the mvnrnd function:

MU1 = [1 2];
SIGMA1 = [2 0; 0 .5];
MU2 = [-3 -5];
SIGMA2 = [1 0; 0 1];
X = [mvnrnd(MU1,SIGMA1,1000);mvnrnd(MU2,SIGMA2,1000)];

scatter(X(:,1),X(:,2),10,'.')
hold on

Next, fit a two-component Gaussian mixture model:

options = statset('Display','final');
obj = gmdistribution.fit(X,2,'Options',options);
10 iterations, log-likelihood = -7046.78

h = ezcontour(@(x,y)pdf(obj,[x y]),[-8 6],[-8 6]);

Among the properties of the fit are the parameter estimates:

ComponentMeans = obj.mu
ComponentMeans =
    0.9391    2.0322
   -2.9823   -4.9737

ComponentCovariances = obj.Sigma
ComponentCovariances(:,:,1) =
    1.7786   -0.0528
   -0.0528    0.5312
ComponentCovariances(:,:,2) =
    1.0491   -0.0150
   -0.0150    0.9816

MixtureProportions = obj.PComponents
MixtureProportions =
    0.5000    0.5000

The Akaike information is minimized by the two-component model:

AIC = zeros(1,4);
obj = cell(1,4);
for k = 1:4
    obj{k} = gmdistribution.fit(X,k);
    AIC(k)= obj{k}.AIC;
end

[minAIC,numComponents] = min(AIC);
numComponents
numComponents =
     2

model = obj{2}
model = 
Gaussian mixture distribution
with 2 components in 2 dimensions
Component 1:
Mixing proportion: 0.500000
Mean:     0.9391    2.0322
Component 2:
Mixing proportion: 0.500000
Mean:    -2.9823   -4.9737

Both the Akaike and Bayes information are negative log-likelihoods for the data with penalty terms for the number of estimated parameters. They are often used to determine an appropriate number of components for a model when the number of components is unspecified.

References

[1] McLachlan, G., and D. Peel. Finite Mixture Models. Hoboken, NJ: John Wiley & Sons, Inc., 2000.

Was this topic helpful?