Principal component analysis (PCA) on data

`princomp`

will be removed in a future release.
Use `pca`

instead.

`[COEFF,SCORE] = princomp(X)`

[COEFF,SCORE,latent] = princomp(X)

[COEFF,SCORE,latent,tsquare] = princomp(X)

[...] = princomp(X,'econ')

`COEFF = princomp(X)`

performs principal components
analysis (PCA) on the *n*-by-*p* data
matrix `X`

, and returns the principal component coefficients,
also known as loadings. Rows of `X`

correspond to
observations, columns to variables. `COEFF`

is a *p*-by-*p* matrix,
each column containing coefficients for one principal component. The
columns are in order of decreasing component variance.

`princomp`

centers `X`

by
subtracting off column means, but does not rescale the columns of `X`

.
To perform principal components analysis with standardized variables,
that is, based on correlations, use `princomp(zscore(X))`

.
To perform principal components analysis directly on a covariance
or correlation matrix, use `pcacov`

.

`[COEFF,SCORE] = princomp(X)`

returns `SCORE`

,
the principal component scores; that is, the representation of `X`

in
the principal component space. Rows of `SCORE`

correspond
to observations, columns to components.

`[COEFF,SCORE,latent] = princomp(X)`

returns `latent`

,
a vector containing the eigenvalues of the covariance matrix of `X`

.

`[COEFF,SCORE,latent,tsquare] = princomp(X)`

returns `tsquare`

,
which contains Hotelling's T^{2} statistic
for each data point.

The scores are the data formed by transforming the original
data into the space of the principal components. The values of the
vector `latent`

are the variance of the columns of `SCORE`

.
Hotelling's T^{2} is a measure of the multivariate
distance of each observation from the center of the data set.

When `n <= p`

, `SCORE(:,n:p)`

and `latent(n:p)`

are
necessarily zero, and the columns of `COEFF(:,n:p)`

define
directions that are orthogonal to `X`

.

`[...] = princomp(X,'econ')`

returns only
the elements of `latent`

that are not necessarily
zero, and the corresponding columns of `COEFF`

and `SCORE`

,
that is, when `n <= p`

, only the first `n-1`

.
This can be significantly faster when `p`

is much
larger than `n`

.

Compute principal components for the `ingredients`

data
in the Hald data set, and the variance accounted for by each component.

load hald; [pc,score,latent,tsquare] = princomp(ingredients); pc,latent pc = -0.0678 -0.6460 0.5673 0.5062 -0.6785 -0.0200 -0.5440 0.4933 0.0290 0.7553 0.4036 0.5156 0.7309 -0.1085 -0.4684 0.4844 latent = 517.7969 67.4964 12.4054 0.2372

The following command and plot show that two components account for 98% of the variance:

cumsum(latent)./sum(latent) ans = 0.86597 0.97886 0.9996 1 biplot(pc(:,1:2),'Scores',score(:,1:2),'VarLabels',... {'X1' 'X2' 'X3' 'X4'})

For a more detailed example and explanation of this analysis method, see Principal Component Analysis (PCA).

[1] Jackson, J. E., *A User's Guide
to Principal Components*, John Wiley and Sons, 1991, p.
592.

[2] Jolliffe, I. T., *Principal
Component Analysis*, 2nd edition, Springer, 2002.

[3] Krzanowski, W. J. *Principles
of Multivariate Analysis: A User's Perspective*. New York:
Oxford University Press, 1988.

[4] Seber, G. A. F., *Multivariate
Observations*, Wiley, 1984.

Was this topic helpful?