This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English verison of the page.

Note: This page has been translated by MathWorks. Please click here
To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.


Probabilistic principal component analysis


  • [coeff,score,pcvar] = ppca(Y,K)
  • [coeff,score,pcvar] = ppca(Y,K,Name,Value)
  • [coeff,score,pcvar,mu] = ppca(___)
  • [coeff,score,pcvar,mu,v,S] = ppca(___)



[coeff,score,pcvar] = ppca(Y,K) returns the principal component coefficients for the n-by-p data matrix Y based on a probabilistic principal component analysis (PPCA). It also returns the principal component scores, which are the representations of Y in the principal component space, and the principal component variances, which are the eigenvalues of the covariance matrix of Y, in pcvar.

Each column of coeff contains coefficients for one principal component, and the columns are in descending order of component variance. Rows of score correspond to observations, and columns correspond to components. Rows of Y correspond to observations and columns correspond to variables.

Probabilistic principal component analysis might be preferable to other algorithms that handle missing data, such as the alternating least squares algorithm when any data vector has one or more missing values. It assumes that the values are missing at random through the data set. An expectation-maximization algorithm is used for both complete and missing data.


[coeff,score,pcvar] = ppca(Y,K,Name,Value) returns the principal component coefficients, scores, and variances using additional options for computation and handling of special data types, specified by one or more Name,Value pair arguments.

For example, you can introduce initial values for the residual variance, v, or change the termination criteria.


[coeff,score,pcvar,mu] = ppca(___) also returns the estimated mean of each variable in Y. You can use any of the input arguments in the previous syntaxes.


[coeff,score,pcvar,mu,v,S] = ppca(___) also returns the isotropic residual variance in v and the final results at convergence in structure S.


collapse all

Load the sample data.

load fisheriris

The double matrix meas consists of four types of measurements on the flowers, which, respectively, are the length and width of sepals and petals.

Introduce missing values randomly.

y = meas;
rng('default'); % for reproducibility
ix = random('unif',0,1,size(y))<0.20;
y(ix) = NaN;

Now, approximately 20% of the data is missing, indicated by NaN.

Perform probabilistic principal component analysis and request the component coefficients and variances.

[coeff,score,pcvar,mu] = ppca(y,3);
coeff =

    0.3562    0.6709   -0.5518
   -0.0765    0.7121    0.6332
    0.8592   -0.1596    0.0596
    0.3592   -0.1318    0.5395
pcvar =


Perform principal component analysis using the alternating least squares algorithm and request the component coefficients and variances.

[coeff2,score2,pcvar2,mu2] = pca(y,'algorithm','als',...
coeff2 =

    0.3376    0.4955    0.7404
   -0.0731    0.8607   -0.4479
    0.8657   -0.1169   -0.1231
    0.3623   -0.0087   -0.4859
pcvar2 =


The coefficients and the variances of the first two principal components are similar.

Another way to compare the results is to find the angle between the two spaces spanned by the coefficient vectors.

ans =


The angle between the two spaces is pretty small. This indicates that these two results are close to each other.

Load the sample data set.

load imports-85

Data matrix X has 13 continuous variables in columns 3 to 15: wheel-base, length, width, height, curb-weight, engine-size, bore, stroke, compression-ratio, horsepower, peak-rpm, city-mpg, and highway-mpg. The variables bore and stroke are missing four values in rows 56 to 59, and the variables horsepower and peak-rpm are missing two values in rows 131 and 132.

Perform probabilistic principal component analysis and display the first three principal components.

[coeff,score,pcvar] = ppca(X(:,3:15),3);
Warning: Maximum number of iterations 1000 reached'. 
> In ppca at 249 

Change the termination tolerance for the cost function to 0.01.

opt = statset('ppca');
opt.TolFun = 0.01;

Perform probabilistic principal component analysis.

[coeff,score,pcvar] = ppca(X(:,3:15),3,'Options',opt);

ppca now terminates before the maximum number of iterations is reached because it meets the tolerance for the cost function.

Load the sample data.

load hald
y = ingredients;

The ingredients data has 13 observations for 4 variables.

Introduce missing values to the data.

y(2:16:end) = NaN;

Every 16th value is NaN. This corresponds to 7.69% of the data.

Find the first three principal components of data using PPCA and display the reconstructed observations.

[coeff,score,pcvar,mu,v,S] = ppca(y,3);
ans =

    6.8533   25.8675    5.8388   59.8755
    1.0431   28.9690   14.9652   51.9758
   11.5770   56.5080    8.6352   20.5062
   11.0833   31.0707    8.0920   47.0764
    7.0684   52.2539    6.0753   33.0597
   11.0486   55.0442    9.0534   22.0410
    2.8494   70.8719   16.8338    5.8624
    1.0331   31.0267   19.6906   44.0321
    2.0401   54.0364   18.0440   22.0337
   20.7823   46.8096    3.7603   25.8075
    0.9540   39.9590   22.9495   31.1540
   10.8251   65.8498    8.8072   11.8420
    9.9174   67.9309    7.9087   11.9230

You can also reconstruct the observations using the principal components and the estimated mean.

t = score*coeff' + repmat(mu,13,1);

Load the sample data.

load hald

Here, ingredients is a real-valued matrix of predictor variables.

Perform the probabilistic principal components analysis and display coefficients.

[coeff,score,pcvariance,mu,v,S] = ppca(ingredients,3);
coeff =

   -0.0693   -0.6459    0.5673
   -0.6786   -0.0184   -0.5440
    0.0308    0.7552    0.4036
    0.7306   -0.1102   -0.4684

Display the algorithm results at convergence of the PPCA.

S = 

           W: [4x3 double]
        Xexp: [13x3 double]
       Recon: [13x4 double]
           v: 0.2372
     NumIter: 1000
    RMSResid: 0.2340
      nloglk: 149.3388

Display the matrix W.

ans =

    0.5624    2.0279    5.4075
    4.8320  -10.3894    5.9202
   -3.7521   -3.0555   -4.1552
   -1.5144   11.7122   -7.2564

Orthogonalizing W recovers the coefficients.

ans =

   -0.0693    0.6459    0.5673
   -0.6786    0.0184   -0.5440
    0.0308   -0.7552    0.4036
    0.7306    0.1102   -0.4684

Input Arguments

collapse all

Input data for which to compute the principal components, specified as an n-by-p matrix. Rows of Y correspond to observations and columns correspond to variables.

Data Types: single | double

Number of principal components to return, specified as an integer value less than the rank of data. The maximum possible rank is min(n,p), where n is the number of observations and p is the number of variables. However, if the data is correlated, the rank might be smaller than min(n,p).

ppca orders the components based on their variance.

If K is min(n,p), ppca sets K equal to min(n,p) – 1, and 'W0' is truncated to min(p,n) – 1 columns if you specify a p-by-p W0 matrix.

For example, you can request only the first three components, based on the component variance as follows.

Example: coeff = ppca(Y,3)

Data Types: single | double

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'W0',init,'Options',opt specifies that the initial values for 'W0' are in matrix init and ppca uses the options defined by opt.

collapse all

Initial value of W in the probabilistic principal component analysis algorithm, specified as a comma-separated pair consisting of 'W0' and a p-by-k matrix.

Data Types: single | double

Initial value of residual variance, specified as the comma-separated pair consisting of 'v0' and a positive scalar value.

Data Types: single | double

Options for the iterations, specified as a comma-separated pair 'Options' and a structure created by the statset function. ppca uses the following fields in the options structure.

'Display'Level of display output. Choices are 'off', 'final', and 'iter'.
'MaxIter'Maximum number of steps allowed. The default is 1000. Unlike in optimization settings, reaching the MaxIter value is regarded as convergence.
'TolFun'Positive integer stating the termination tolerance for the cost function. The default is 1e-6.
'TolX'Positive integer stating the convergence threshold for the relative change in the elements of W. The default is 1e-6.

You can change the values of these fields and specify the new structure in ppca using the 'Options' name-value pair argument.

Example: opt = statset('ppca'); opt.MaxIter = 2000; coeff = ppca(Y,3,'Options',opt);

Data Types: struct

Output Arguments

collapse all

Principal component coefficients, returned as a p-by-k matrix. Each column of coeff contains coefficients for one principal component. The columns are in the order of descending component variance, pcvar.

Principal component scores, returned as an n-by-k matrix. Rows of score correspond to observations, and columns correspond to components.

Principal component variances, which are the eigenvalues of the covariance matrix of Y, returned as a column vector.

Estimated mean of each variable in Y, returned as a row vector.

Isotropic residual variance, returned as a scalar value.

Final results at convergence, returned as a structure containing the following fields.

WW at convergence.
XexpConditional expectation of the estimated latent variable x.
ReconReconstructed observations using k principal components. This is a low dimension approximation of the input data Y, and is equal to mu + score*coeff'.
vResidual variance.
RMSResidRoot mean square of residuals.
NumIterNumber of iteration counts.
nloglkNegative loglikelihood function value.

More About

collapse all

Probabilistic Principal Component Analysis

Probabilistic principal component analysis (PPCA) is a method to estimate the principal axes when any data vector has one or more missing values.

PPCA is based on an isotropic error model. It seeks to relate a p-dimensional observation vector y to a corresponding k-dimensional vector of latent (or unobserved) variable x, which is normal with mean zero and covariance I(k). The relationship is


where y is the row vector of observed variable, x is the row vector of latent variables, and ε is the isotropic error term. ε is Gaussian with mean zero and covariance of v*I(k), where v is the residual variance. Here, k needs to be smaller than the rank for the residual variance to be greater than 0 (v>0). Standard principal component analysis, where the residual variance is zero, is the limiting case of PPCA. The observed variables, y, are conditionally independent given the values of the latent variables, x. So, the latent variables explain the correlations between the observation variables and the error explains the variability unique to a particular yi. The p-by-k matrix W relates the latent and observation variables, and the vector μ permits the model to have a nonzero mean. PPCA assumes that the values are missing at random through the data set. This means that whether a data value is missing or not does not depend on the latent variable given the observed data values.

Under this model,


There is no closed-form analytical solution for W and v, so their estimates are determined by iterative maximization of the corresponding loglikelihood using an expectation-maximization (EM) algorithm. This EM algorithm handles missing values by treating them as additional latent variables. At convergence, the columns of W spans the subspace, but they are not orthonormal. ppca obtains the orthonormal coefficients, coeff, for the components by orthogonalization of W.


[1] Tipping, M. E., and C. M. Bishop. Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society. Series B (Statistical Methodology), Vol. 61, No.3, 1999, pp. 611–622.

[2] Roweis, S. "EM Algorithms for PCA and SPCA." In Proceedings of the 1997 Conference on Advances in Neural Information Processing Systems. Vol.10 (NIPS 1997), Cambridge, MA, USA: MIT Press, 1998, pp. 626–632.

[3] Ilin, A., and T. Raiko. "Practical Approaches to Principal Component Analysis in the Presence of Missing Values." J. Mach. Learn. Res.. Vol. 11, August, 2010, pp. 1957–2000.

Introduced in R2013a

Was this topic helpful?