ppca

Probabilistic principal component analysis

Syntax

  • [coeff,score,pcvar] = ppca(Y,K) example
  • [coeff,score,pcvar] = ppca(Y,K,Name,Value) example
  • [coeff,score,pcvar,mu] = ppca(___) example
  • [coeff,score,pcvar,mu,v,S] = ppca(___) example

Description

example

[coeff,score,pcvar] = ppca(Y,K) returns the principal component coefficients for the n-by-p data matrix Y based on a probabilistic principal component analysis (PPCA). It also returns the principal component scores, which are the representations of Y in the principal component space, and the principal component variances, which are the eigenvalues of the covariance matrix of Y, in pcvar.

Each column of coeff contains coefficients for one principal component, and the columns are in descending order of component variance. Rows of score correspond to observations, and columns correspond to components. Rows of Y correspond to observations and columns correspond to variables.

Probabilistic principal component analysis might be preferable to other algorithms that handle missing data, such as the alternating least squares algorithm when any data vector has one or more missing values. It assumes that the values are missing at random through the data set. An expectation-maximization algorithm is used for both complete and missing data.

example

[coeff,score,pcvar] = ppca(Y,K,Name,Value) returns the principal component coefficients, scores, and variances using additional options for computation and handling of special data types, specified by one or more Name,Value pair arguments.

For example, you can introduce initial values for the residual variance, v, or change the termination criteria.

example

[coeff,score,pcvar,mu] = ppca(___) also returns the estimated mean of each variable in Y. You can use any of the input arguments in the previous syntaxes.

example

[coeff,score,pcvar,mu,v,S] = ppca(___) also returns the isotropic residual variance in v and the final results at convergence in structure S.

Examples

expand all

Perform Probabilistic Principal Component Analysis

Load the sample data.

load fisheriris

The double matrix meas consists of four types of measurements on the flowers, which, respectively, are the length and width of sepals and petals.

Introduce missing values randomly.

y = meas;
rng('default'); % for reproducibility
ix = random('unif',0,1,size(y))<0.20;
y(ix) = NaN;

Now, approximately 20% of the data is missing, indicated by NaN.

Perform probabilistic principal component analysis and request the component coefficients and variances.

[coeff,score,pcvar,mu] = ppca(y,3);
coeff
coeff =

    0.3562    0.6709   -0.5518
   -0.0765    0.7121    0.6332
    0.8592   -0.1596    0.0596
    0.3592   -0.1318    0.5395
pcvar
pcvar =

   4.0912
   0.2126
   0.0617

Perform principal component analysis using the alternating least squares algorithm and request the component coefficients and variances.

[coeff2,score2,pcvar2,mu2] = pca(y,'algorithm','als',...
'NumComponents',3);
coeff2
coeff2 =

    0.3376    0.4955    0.7404
   -0.0731    0.8607   -0.4479
    0.8657   -0.1169   -0.1231
    0.3623   -0.0087   -0.4859
pcvar2
pcvar2 =

    4.0734
    0.2651
    0.1221

The coefficients and the variances of the first two principal components are similar.

Another way to compare the results is to find the angle between the two spaces spanned by the coefficient vectors.

subspace(coeff,coeff2)
ans =

     0.0881

The angle between the two spaces is pretty small. This indicates that these two results are close to each other.

Change the Termination Criteria for Probabilistic Principal Component Analysis

Load the sample data set.

load imports-85

Data matrix X has 13 continuous variables in columns 3 to 15: wheel-base, length, width, height, curb-weight, engine-size, bore, stroke, compression-ratio, horsepower, peak-rpm, city-mpg, and highway-mpg. The variables bore and stroke are missing four values in rows 56 to 59, and the variables horsepower and peak-rpm are missing two values in rows 131 and 132.

Perform probabilistic principal component analysis and display the first three principal components.

[coeff,score,pcvar] = ppca(X(:,3:15),3);
Warning: Maximum number of iterations 1000 reached'. 
> In ppca at 249 

Change the termination tolerance for the cost function to 0.01.

opt = statset('ppca');
opt.TolFun = 0.01;

Perform probabilistic principal component analysis.

[coeff,score,pcvar] = ppca(X(:,3:15),3,'Options',opt);

ppca now terminates before the maximum number of iterations is reached because it meets the tolerance for the cost function.

Reconstruct Observations

Load the sample data.

load hald
y = ingredients;

The ingredients data has 13 observations for 4 variables.

Introduce missing values to the data.

y(2:16:end) = NaN;

Every 16th value is NaN. This corresponds to 7.69% of the data.

Find the first three principal components of data using PPCA and display the reconstructed observations.

[coeff,score,pcvar,mu,v,S] = ppca(y,3);
S.Recon
ans =

    6.8533   25.8675    5.8388   59.8755
    1.0431   28.9690   14.9652   51.9758
   11.5770   56.5080    8.6352   20.5062
   11.0833   31.0707    8.0920   47.0764
    7.0684   52.2539    6.0753   33.0597
   11.0486   55.0442    9.0534   22.0410
    2.8494   70.8719   16.8338    5.8624
    1.0331   31.0267   19.6906   44.0321
    2.0401   54.0364   18.0440   22.0337
   20.7823   46.8096    3.7603   25.8075
    0.9540   39.9590   22.9495   31.1540
   10.8251   65.8498    8.8072   11.8420
    9.9174   67.9309    7.9087   11.9230

You can also reconstruct the observations using the principal components and the estimated mean.

t = score*coeff' + repmat(mu,13,1);

Results at Convergence

Load the sample data.

load hald

Here, ingredients is a real-valued matrix of predictor variables.

Perform the probabilistic principal components analysis and display coefficients.

[coeff,score,pcvariance,mu,v,S] = ppca(ingredients,3);
coeff
coeff =

   -0.0693   -0.6459    0.5673
   -0.6786   -0.0184   -0.5440
    0.0308    0.7552    0.4036
    0.7306   -0.1102   -0.4684

Display the algorithm results at convergence of the PPCA.

S
S = 

           W: [4x3 double]
        Xexp: [13x3 double]
       Recon: [13x4 double]
           v: 0.2372
     NumIter: 1000
    RMSResid: 0.2340
      nloglk: 149.3388

Display the matrix W.

S.W
ans =

    0.5624    2.0279    5.4075
    4.8320  -10.3894    5.9202
   -3.7521   -3.0555   -4.1552
   -1.5144   11.7122   -7.2564

Orthogonalizing W recovers the coefficients.

orth(S.W)
ans =

   -0.0693    0.6459    0.5673
   -0.6786    0.0184   -0.5440
    0.0308   -0.7552    0.4036
    0.7306    0.1102   -0.4684

Input Arguments

expand all

Y — Input datan-by-p matrix

Input data for which to compute the principal components, specified as an n-by-p matrix. Rows of Y correspond to observations and columns correspond to variables.

Data Types: single | double

K — Number of principal componentspositive integer value less than rank

Number of principal components to return, specified as an integer value less than the rank of data. The maximum possible rank is min(n,p), where n is the number of observations and p is the number of variables. However, if the data is correlated, the rank might be smaller than min(n,p).

ppca orders the components based on their variance.

If K is min(n,p), ppca sets K equal to min(n,p) – 1, and 'W0' is truncated to min(p,n) – 1 columns if you specify a p-by-p W0 matrix.

For example, you can request only the first three components, based on the component variance as follows.

Example: coeff = ppca(Y,3)

Data Types: single | double

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'W0',init,'Options',opt specifies that the initial values for 'W0' are in matrix init and ppca uses the options defined by opt.

'W0' — Initial value of Wmatrix of random values (default) | p-by-k matrix

Initial value of W in the probabilistic principal component analysis algorithm, specified as a comma-separated pair consisting of 'W0' and a p-by-k matrix.

Data Types: single | double

'v0' — Initial value of residual variancerandom number (default) | positive scalar value

Initial value of residual variance, specified as the comma-separated pair consisting of 'v0' and a positive scalar value.

Data Types: single | double

'Options' — Options for iterationsstructure

Options for the iterations, specified as a comma-separated pair 'Options' and a structure created by the statset function. ppca uses the following fields in the options structure.

'Display'Level of display output. Choices are 'off', 'final', and 'iter'.
'MaxIter'Maximum number of steps allowed. The default is 1000. Unlike in optimization settings, reaching the MaxIter value is regarded as convergence.
'TolFun'Positive integer stating the termination tolerance for the cost function. The default is 1e-6.
'TolX'Positive integer stating the convergence threshold for the relative change in the elements of W. The default is 1e-6.

You can change the values of these fields and specify the new structure in ppca using the 'Options' name-value pair argument.

Example: opt = statset('ppca'); opt.MaxIter = 2000; coeff = ppca(Y,3,'Options',opt);

Data Types: struct

Output Arguments

expand all

coeff — Principal component coefficientsp-by-k matrix

Principal component coefficients, returned as a p-by-k matrix. Each column of coeff contains coefficients for one principal component. The columns are in the order of descending component variance, pcvar.

score — Principal component scoresn-by-k matrix

Principal component scores, returned as an n-by-k matrix. Rows of score correspond to observations, and columns correspond to components.

pcvar — Principal component variancescolumn vector

Principal component variances, which are the eigenvalues of the covariance matrix of Y, returned as a column vector.

mu — Estimated meanrow vector

Estimated mean of each variable in Y, returned as a row vector.

v — Isotropic residual variancescalar value

Isotropic residual variance, returned as a scalar value.

S — Final results at convergencestructure

Final results at convergence, returned as a structure containing the following fields.

WW at convergence.
XexpConditional expectation of the estimated latent variable x.
ReconReconstructed observations using k principal components. This is a low dimension approximation of the input data Y, and is equal to mu + score*coeff'.
vResidual variance.
RMSResidRoot mean square of residuals.
NumIterNumber of iteration counts.
nloglkNegative loglikelihood function value.

More About

expand all

Probabilistic Principal Component Analysis

Probabilistic principal component analysis (PPCA) is a method to estimate the principal axes when any data vector has one or more missing values.

PPCA is based on an isotropic error model. It seeks to relate a p-dimensional observation vector y to a corresponding k-dimensional vector of latent (or unobserved) variable x, which is normal with mean zero and covariance I(k). The relationship is

yT=WxT+μ+ε,

where y is the row vector of observed variable, x is the row vector of latent variables, and ε is the isotropic error term. ε is Gaussian with mean zero and covariance of v*I(k), where v is the residual variance. Here, k needs to be smaller than the rank for the residual variance to be greater than 0 (v>0). Standard principal component analysis, where the residual variance is zero, is the limiting case of PPCA. The observed variables, y, are conditionally independent given the values of the latent variables, x. So, the latent variables explain the correlations between the observation variables and the error explains the variability unique to a particular yi. The p-by-k matrix W relates the latent and observation variables, and the vector μ permits the model to have a nonzero mean. PPCA assumes that the values are missing at random through the data set. This means that whether a data value is missing or not does not depend on the latent variable given the observed data values.

Under this model,

y~N(μ,W*WT+v*I(k)).

There is no closed-form analytical solution for W and v, so their estimates are determined by iterative maximization of the corresponding loglikelihood using an expectation-maximization (EM) algorithm. This EM algorithm handles missing values by treating them as additional latent variables. At convergence, the columns of W spans the subspace, but they are not orthonormal. ppca obtains the orthonormal coefficients, coeff, for the components by orthogonalization of W.

References

[1] Tipping, M. E., and C. M. Bishop. Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society. Series B (Statistical Methodology), Vol. 61, No.3, 1999, pp. 611–622.

[2] Roweis, S. "EM Algorithms for PCA and SPCA." In Proceedings of the 1997 Conference on Advances in Neural Information Processing Systems. Vol.10 (NIPS 1997), Cambridge, MA, USA: MIT Press, 1998, pp. 626–632.

[3] Ilin, A., and T. Raiko. "Practical Approaches to Principal Component Analysis in the Presence of Missing Values." J. Mach. Learn. Res.. Vol. 11, August, 2010, pp. 1957–2000.

Was this topic helpful?