Code covered by the BSD License  

Highlights from
EM algorithm for Gaussian mixture model

4.63158

4.6 | 20 ratings Rate this file 489 Downloads (last 30 days) File Size: 19.48 KB File ID: #26184
image thumbnail

EM algorithm for Gaussian mixture model

by Mo Chen

 

23 Dec 2009 (Updated 29 Feb 2012)

EM algorithm for Gaussian mixture. Works on arbitray dimensions with high speed and precision.

| Watch this File

File Information
Description

This is a function tries to obtain the maximum likelihood estimation of Gaussian mixture model by expectation maximization (EM) algorithm.

It works on data set of arbitrary dimensions. Several techniques are applied to avoid the float number underflow problems that often occurs on computing probability of high dimensional data. Also the code is carefully tuned to be efficient by utilizing vertorization and matrix factorization.

This is a widely used algorithm. The detail of this algorithm can be found in many textbooks or tutorials online. Just google EM Gaussian Mixture or you can read the wiki page:
http://en.wikipedia.org/wiki/Expectation-maximization_algorithm

This function is robust and efficient yet the code structure is organized so that it is easy to read.

example:
load data;
label = emgm(x,3);
spread(x,label);

Besides using EM to fit GMM, I highly recommend you to try another submission of mine: Variational Bayesian Inference for Gaussian Mixture Model
(http://www.mathworks.com/matlabcentral/fileexchange/35362-variational-bayesian-inference-for-gaussian-mixture-model) which perform Bayesian inference on GMM. It has the advantage that the number of mixture components can be automatically identified by the algorithm.

For all the question regarding to use the code for image segmentation, you have to orgnize the image into a matrix, where each column is the feature vector of one pixel of the image. For example, if RGB value is used, for a 10x10 image the data matrix is a 3x100 matrix where each column is a vector of RGB value of a pixel.

Acknowledgements

The author wishes to acknowledge the following in the creation of this submission:
Variational Bayesian Inference for Gaussian Mixture Model
This submission has inspired the following:
EM algorithm for Gaussian mixture model with background noise

MATLAB release MATLAB 7.9 (2009b)
Tags for This File  
Everyone's Tags
Tags I've Applied
Add New Tags Please login to tag files.
Comments and Ratings (56)
09 Feb 2010 Gayathri

Produces the following error with the above steps.

label = emgmm(x,3);
??? Error: File: emgmm.m Line: 21 Column: 7
Expression or statement is incorrect--possibly unbalanced (, {, or [.

09 Feb 2010 Mo Chen

Before you give any bad rating, you should really notice that this function require MATLAB 7.9 (2009b).
It use a new feature of matlab.
upgrade your matlab, or you can modify all
[~,x]=fun();
to
[dum,x]=fun();

16 Feb 2010 Daniil Kocharov

Sorry for asking such a silly question:
I got an error trying to use 1d data.
Error using ==> loggausspdf at 7 ERROR: sigma is not SPD.
Error in ==> emgmm>expectation at 68
    R(i,:) = loggausspdf(X,mu(:,i),Sigma(:,:,i));
I think that's because Sigma is:
Sigma(:,:,1) = NaN
Sigma(:,:,2) = NaN
Sigma(:,:,3) = 7.0826e-005
But why is it NaN I cannot understand, or is there anything else wrong?
Thanks.

16 Feb 2010 Mo Chen

Can you send me your data via email?

24 Jun 2010 Tianfan XUE  
22 Sep 2010 Patrick

Apologize for the following simple question. What exactly does the sigma data mean from the example given? The first Sigma (1,1) is the sigma for the first estimated cluster and the second sigma (2,2) is for the first estimated cluster on the second row ??

Could you please clarify? Thanks.

>> model.Sigma

ans(:,:,1) =

    0.7227 0.8439
    0.8439 1.8252

ans(:,:,2) =

    0.2629 -0.1116
   -0.1116 0.2411

ans(:,:,3) =

    0.4209 -0.0600
   -0.0600 0.0967

19 Oct 2010 Nofil Barlas

Quick question. I ran the code and what the error:

>> load data;
label = emgm(x,3);
scatterd(x,label);
EM for Gaussian mixture: running ... ??? Error using ==> randsample at 117
K must be less than or equal to N for sampling without replacement.

Error in ==> emgm>initialization at 36
    idx = randsample(n,k);

Error in ==> emgm at 9
R = initialization(X,init);

Thanks.

27 Oct 2010 dattatray

Can You please tell me..about initialization which you have made.
generating random values is fine..
but i havent understood

use of maxVal=bsxfun(@minus,m'*X, sum(m.^2,1)'/2 )
[dum,label] = max(bsxfun(@minus,m'*X,sum(m.^2,1)'/2));
can you please tell me that...?
code is really wonderful,
but if i could get any theory material regarding functions you have written,especially for expectation,maximization,and log gaussian pdf..
please mail me on d.dattatray@gmail.com
thank you in advance...

27 Oct 2010 Mo Chen

Hi, Patrick
Sigma(:,:,1) is the covariance matrix of the first gaussian mixture component.

27 Oct 2010 Mo Chen

Hi dattatray,
Take a look at the comment in the code of
http://www.mathworks.com/matlabcentral/fileexchange/24616-kmeans-clustering
You may get the idea.

27 Oct 2010 Mo Chen

Hi, Nofil Barlas,
Maybe you forget clear your memory before load the data.

28 Oct 2010 dattatray

Thank you Very much sir..!!

02 Nov 2010 Giang Le

Can you please let me know how to fix the ERROR: sigma is not SPD?

03 Nov 2010 Mo Chen

Giang Le,
How does it happen? The function can hardly produce a non positive definite sigma.
However, if it does, you may try to change the sigma0 in line 76 to be a larger value.

03 Nov 2010 Giang Le

Hi Michael,
Thanks for a quick reply. Here is the problem, I am try to clustering 11208 samples to 128 with dim is 14.
> x = rand(14,11208);
>[est_label,model] = emgm(x,128);
EM for Gaussian mixture: running ... ??? Error using ==> loggausspdf at 7
ERROR: sigma is not SPD.

Error in ==> emgm>expectation at 65
    R(:,i) = loggausspdf(X,mu(:,i),Sigma(:,:,i));

Error in ==> emgm at 16
    [R, llh(t)] = expectation(X,model);
I have tried to increase the sigma0 but the problem is still there.

03 Nov 2010 Mo Chen

Not happened here. which version of matlab are your using?

04 Nov 2010 Giang Le

i see now, I have tried with 2009a and earlier version and it gave me error when i increased number of clusters. Work fine with 2009b although it is not converge.
I am very thankful for your reply.

23 Dec 2010 Nevine

Michael,
I am getting the error:
??? Error using ==> loggausspdf at 10
ERROR: sigma is not SPD.
I am using matlab Release R2010a.
The input data X is 24x57600 with 2 clusters.

labels = emgm(X, 2);

I will send you the data via email.

Thanks.

23 Dec 2010 Nevine

Michael,
The email address in the file bounced. Please send me your address so that I can email you the data file.
Thanks.

09 Feb 2011 Daniel Zoran  
04 Mar 2011 Michael Davis

A couple of minor bugs:

1. I came across the same problem as Nofil Barlas above when the size of the input vector is [ N 1 ]. Reshape to [ 1 N ] and it works.

2. If you tell it to find only 1 mixture, it keeps going until it runs out of memory. The code should either disallow an init parameter of 1, or else have a short function to handle this trivial case.

Otherwise, great, very useful! Thanks.

07 Mar 2011 Silvina

Dear Michael,
I'm trying to use your code on images (using reshape to give them a vector structure) and I'm getting the following error message:
Error using ==> loggausspdf at 10
ERROR: sigma is not SPD.

Interestingly, after calling the command many times, the function eventually works.

Any feedback on this issue will be greatly appreciated!

13 Mar 2011 Mo Chen

Silvina,
Hi, I will upload a new version. Please try it and tell me if it still happens.

27 Apr 2011 Enrique Martí  
27 Apr 2011 Neha

k = 1;
X = imread('image.png');
label = emgm(X,3);
spread(X,label);

Please tell me how to fix the errors listed below:

EM for Gaussian mixture: running ...
??? Error using ==> mtimes
Integer data types are not fully supported for this operation.
At least one operand must be a scalar.

Error in ==> emgm>initialization at 46
    [dum,label] = max(bsxfun(@minus,m'*X,sum(m.^2,1)'/2),[],1);

Error in ==> emgm at 8
R = initialization(X,init);

Error in ==> Untitled at 2
label = emgm(X,3);
  

04 May 2011 Brian

Found this pointless piece of code in the initialization:

while k ~= unique(label)
        idx = randsample(n,k);
        m = X(:,idx);
        [~,label] = max(bsxfun(@minus,m'*X,sum(m.^2,1)'/2),[],1);
end

Unless I am missing something, I'm assuming you were trying to make sure at least one point is assigned to each cluster? Well, this just checks if at least 1 point is assigned to the kth cluster. E.g. try:
k=5;
if k ~= unique([5;5;5;5])
disp('Bad!');
else
disp('OK');
end
it will say that label assignment is OK.

Furthermore since you draw the centers from the points themselves, there will always be at least 1 point in each cluster, making even the intended code pointless.

You may want to use another strategy to ensure centers are chosen that take more than a single point for instance.

Another common initialization strategy is to partition the points randomly into k clusters.

15 May 2011 Anathea Pepperl  
20 May 2011 Venkat R

Dear Chen,
Very good and fast implementation.
I guess the normal distribution you are using is exp( -(x-mu)^2/2*sigma^2 )/sqrt(2*pi*sigma^2)

In that case, if I were to slightly modify the sigma by w*sigma(or mu by w*mu), where 'w' is another design parameter, Can you help me which functions I need to change to utilize your code.

Thanking you very much.

10 Jun 2011 Mo Chen

To Brian,
"Furthermore since you draw the centers from the points themselves, there will always be at least 1 point in each cluster, making even the intended code pointless."
If the initialized k centers are very close, after one iteration of the EM, there will be only one cluster there.
This piece of code simply prevent this from happening. It ensure that there is no more than two initialized centers belong to one cluster.

10 Jun 2011 Mo Chen

To Venkat R,

This code uses general form ofthe multivariable Gaussian distribution, not the one in your comment, which is simply the 1d special case.

You cannot arbitrayly add a parameter there. You have to ensure the density function is actually a valid density function (means it has to integrate to 1). Otherwise, EM does not work.

12 Jun 2011 minni sharma

To chen
can u send me a code for image fusion using EM algorithm please.

thanking you

23 Jun 2011 dattatray

I have a small Question,
suppose i have modeled a data.
it has 100 vectors each having 36 features.
so my input is 36 x 100
now i have given 5 clusters,model is trained now.
now i have set of 5 mu(36 x 1) and 5 sigma(36 x 36 x 5).
lets us say i have a unknown vector x of size(36 x 1)
now how to find out,in which cluster this particular vector fits..?
for each cluster we have only mu,and sigma,and for a single vector matlab gives sigle value of mean,and cov()
can you help me in this..?
if it would have been k-means,its easy to calculate euclidean distance with cluster centroid,which ever is minimum,that is answer..

29 Jun 2011 Ting

Dear Chen,
When I using EM to analysis my data, the result is always changing, and converge step is changing too, is there any way to make it stable?
Thank you

05 Jul 2011 HONGJING  
25 Jul 2011 Amish  
25 Jul 2011 Amish

Same question as Ting:
"converging steps are changing for the same data"
Must have to do with the latest Matlab release. I am using R2011b.
Michael, can you confirm?

27 Jul 2011 Amish

Fixed seed for random generator and got the same plot. This is a very useful utility. Many Thanks.

06 Aug 2011 Michael

Very easy to use and fast, but like some of the above posters, I am getting different results every time I run it on the same data.

23 Dec 2011 Nicolas

Hi, I try with 1 D array, and I have this problem

>> label = emgm(a,1);
EM for Gaussian mixture: running ...
Converged in 2 steps.
>> spread(a,label);
??? Error using ==> spread at 33
ERROR: only support data of 2D or 3D.

How I can solve it

thanks

13 Jan 2012 Mark

For people getting different results on each run, this is due to the use of psuedorandom number generators in initialization. Try setting the psuedorandom number seed:
http://www.mathworks.com/help/techdoc/ref/rng.html

23 Jan 2012 AISSA BOULMERKA

Thank You for this Excellent Work,
is there any paper that may help to understand the program?
thanks.

07 Feb 2012 Mo Chen

For any one having question about changing result:

Please read wiki page. EM is only finding local minimal, which means the result depend on initialization.

07 Feb 2012 Mo Chen

Nicolas:
I dont see problem

08 Feb 2012 Andreas

Hi Michael,

A small question: the randsample function called at line 44 seemingly does not exist, as I get an error. I am running R2011b. Are other, non-native, files required to run emgm? Thanks.

08 Feb 2012 Mo Chen

Hi Andreas, that function is in statistics toolbox. It random sample k integers between 1 to n.

14 Feb 2012 Prasath

we doing project on statistical pixel intensity segmentation of clsm images..
we need coding for gaussian mixture, normal distribution, poisson...
plss mail coding to tis mail id vinodhinybtech@gmail.com

20 Feb 2012 Adili neila

Hi Michael,
how can I use your code for images?
thanks

22 Feb 2012 Mo Chen

For all the questions about how to use is for image segmentation:
You have to organize the image into a matrix where each column is the feature of a pixel(say rgb)

14 Mar 2012 zheng zhou

Hi Michael,
I have a 65*100matrix,can I use this code to get the two-dimension GMM,in which the mu sigma and weight are two dimension.(z=f(x,y),f is the function for GMM)

14 Mar 2012 shamla  
14 Mar 2012 shamla

i need to apply gmm to iris dataset and obtain 3 clusters.i need to display the (index of datapoints)datapoints in each cluster.please help me.

18 Mar 2012 keerthi

hello sir, we are using em algorithm for detecting resampling (tampering of images). for this we need to get the fourier transform of the probability map. how can we modify this code for the above purpose. kindly help.
my mail id : keerthi2412@gmail.com

16 Apr 2012 fateme

hello,I want to apply emgm on adult dataset,which it's attributes are both categorical and numerical,I tried to apply clustering on data saved in dataset and in cell array,but this data types are not defined for emgm. can emgm be used for string array?pleas help me.tnx

16 Apr 2012 Cong

Excellent Work! Thank you !

30 Apr 2012 Bogdan Dzyubak

Simple to use, fast, and doesn't crash.

18 May 2012 marouane ayech

Hi Michael,
I want to know wheither there is a theoretical proof for the technique you have used in logsumexp to avoid numerical underflow ?

Please login to add a comment or rating.
Updates
25 Dec 2009

fix missing files

28 Dec 2009

add missing files

24 Jan 2010

update description

27 Jan 2010

Fixed missing file

04 Mar 2010

fix bug for 1d data

04 Mar 2010

fix bug for 1d data

30 Sep 2010

reorganize and clean the code a bit

03 Nov 2010

update loggausspdf due to api change of matlab

13 Mar 2011

Fix several minor bugs and reorganize the code structure a bit.

04 Feb 2012

tuning

29 Feb 2012

Update description

Tag Activity for this File
Tag Applied By Date/Time
em Mo Chen 23 Dec 2009 16:06:54
gmm Mo Chen 23 Dec 2009 16:06:54
mle Mo Chen 23 Dec 2009 16:06:54
gaussian Mo Chen 25 Dec 2009 07:22:31
mixture Mo Chen 25 Dec 2009 07:22:39
clustering Mo Chen 28 Dec 2009 13:38:56
mixture jose avila 11 May 2010 22:45:45
em jose avila 11 May 2010 23:23:02
em dattatray 27 Oct 2010 06:34:53
clustering li li 09 Nov 2010 02:15:58
gmm li li 09 Nov 2010 02:31:39
em li li 09 Nov 2010 02:31:48
gaussian xin sun 21 Apr 2011 04:13:12
clustering yuki 25 Jun 2011 10:46:04
clustering uunnaa 10 Aug 2011 01:43:57
em uunnaa 10 Aug 2011 01:44:00
em satyasis 29 Aug 2011 00:44:00
em Przemyslaw 20 Sep 2011 08:08:23
em z cn 07 Dec 2011 01:09:22
gmm kamil kamil 16 May 2012 10:34:35
mle Boguslaw Obara 30 May 2012 08:31:53

Contact us at files@mathworks.com