EMPCA calculates principal components using an expectation maximization algorithm to find each component in the residual matrix after substracting the previously converged principal components.
EMPCA_W accepts a weight matrix to use in the weighted EM algorithm.
EMPCA_NAN accepts a data matrix with nans to use in the missing data EM algorithm.
An informative message reports the number of EM iterations computed for each component, revealing if the convergence was achieved under a certain tolerance, or if the iterations were stopped after a maximum number.
This implementation is especially useful to handle large matrices, and runs fast on gpuArray matrices.
The algorithm is described in
Bailey, Stephen. "Principal Component Analysis with Noisy and/or Missing Data." Publications of the Astronomical Society of the Pacific 124.919 (2012): 1015-1023.
http://arxiv.org/pdf/1208.4122v2.pdf
Vicente Parot (2020). EMPCA (https://www.mathworks.com/matlabcentral/fileexchange/45353-empca), MATLAB Central File Exchange. Retrieved .
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!Create scripts with code, output, and formatted text in a single executable document.
Hello!
I am trying to compare the speedup achieved using GPU vs. CPU. When I am trying to use timeit function of matlab to compute time, it gives error. And as per the matlab documentation, tic, toc can not be used for gpuArray. Please guide me how to compare the speed up? Thanks.
I have a problem to run could i request for the full product?
please help me..
Thank Vicente Parot very much! Very useful code, I used the function of empca_nan.m and compared with normal pca.m from MATLAB:
(1) the traditional PCA is: Data=Mean(Data,1)+score*Coeff', [Coeff,score,latent,tsquared,explained,mu] = pca(Data);
(2) the empca_nan.m is: Data=score1*s*Coeff'+A, [score1, s, coeff, a] = empca_nan(a, ncomps, emtol, maxiters)
the score in pca.m from MATLAB is :score=score1*s;
the speed is largely faster than pca(Data,'algorithm','als');
Dear Lowell,
The updated submission adds implementation of the weighted and missing data versions of the algorithm described in the reference.
This allows to compute PCA with missing values.
Let me be more specific. Each observation in the data array (described in my initial comment) is comprised of several interrelated subsets of data (of varying lengths), which cumulatively account for the 20,000 variables.
The missing values describe an entire subset of variables in a given observation. That is, if there are 5 sets of variables in each observation, the missing subset accounts for some of the 20,000 variables.
I have a data array that is about 50x20,000 (50 observations of 20,000 variables). Some of the variable values in one of the observations is missing and I would like to estimate these missing values. If it is possible to do that with this code, please advise. Thanks!