Principal component analysis of raw data
returns
the principal component coefficients, also known as loadings, for
the nbyp data matrix coeff
= pca(X
)X
.
Rows of X
correspond to observations and columns
correspond to variables. The coefficient matrix is pbyp.
Each column of coeff
contains coefficients for
one principal component, and the columns are in descending order of
component variance. By default, pca
centers the
data and uses the singular value decomposition (SVD) algorithm.
returns
any of the output arguments in the previous syntaxes using additional
options for computation and handling of special data types, specified
by one or more coeff
= pca(X
,Name,Value
)Name,Value
pair arguments.
For example, you can specify the number of principal components pca
returns
or an algorithm other than SVD to use.
[
also returns the principal component
scores in coeff
,score
,latent
]
= pca(___)score
and the principal component variances
in latent
. You can use any of the input arguments
in the previous syntaxes.
Principal component scores are the representations of X
in
the principal component space. Rows of score
correspond
to observations, and columns correspond to components.
The principal component variances are the eigenvalues of the
covariance matrix of X
.
Load the sample data set.
load hald
The ingredients data has 13 observations for 4 variables.
Find the principal components for the ingredients data.
coeff = pca(ingredients)
coeff = 0.0678 0.6460 0.5673 0.5062 0.6785 0.0200 0.5440 0.4933 0.0290 0.7553 0.4036 0.5156 0.7309 0.1085 0.4684 0.4844
The rows of coeff
contain the coefficients
for the four ingredient variables, and its columns correspond to four
principal components.
Find the principal component coefficients when there are missing values in a data set.
Load the sample data set.
load imports85
Data matrix X
has 13 continuous variables
in columns 3 to 15: wheelbase, length, width, height, curbweight,
enginesize, bore, stroke, compressionratio, horsepower, peakrpm,
citympg, and highwaympg. The variables bore and stroke are missing
four values in rows 56 to 59, and the variables horsepower and peakrpm
are missing two values in rows 131 and 132.
Perform principal component analysis.
coeff = pca(X(:,3:15));
By default, pca
performs the action specified
by the 'Rows','complete'
namevalue pair argument.
This option removes the observations with NaN
values
before calculation. Rows of NaN
s are reinserted
into score
and tsquared
at the
corresponding locations, namely rows 56 to 59, 131, and 132.
Use 'pairwise'
to perform the principal
component analysis.
coeff = pca(X(:,3:15),'Rows','pairwise');
In this case, pca
computes the (i,j)
element of the covariance matrix using the rows with no NaN
values
in the columns i or j of X
.
Note that the resulting covariance matrix might not be positive definite.
This option applies when the algorithm pca
uses
is eigenvalue decomposition. When you don't specify the algorithm,
as in this example, pca
sets it to 'eig'
.
If you require 'svd'
as the algorithm, with the 'pairwise'
option,
then pca
returns a warning message, sets the algorithm
to 'eig'
and continues.
If you use the 'Rows','all'
namevalue
pair argument, pca
terminates because this option
assumes there are no missing values in the data set.
coeff = pca(X(:,3:15),'Rows','all');
Error using pca (line 180) Raw data contains NaN missing value while 'Rows' option is set to 'all'. Consider using 'complete' or pairwise' option instead.
Use the inverse variable variances as weights while performing the principal components analysis.
Load the sample data set.
load hald
Perform the principal component analysis using the inverse of variances of the ingredients as variable weights.
[wcoeff,~,latent,~,explained] = pca(ingredients,... 'VariableWeights','variance')
wcoeff = 2.7998 2.9940 3.9736 1.4180 8.7743 6.4411 4.8927 9.9863 2.5240 3.8749 4.0845 1.7196 9.1714 7.5529 3.2710 11.3273 latent = 2.2357 1.5761 0.1866 0.0016 explained = 55.8926 39.4017 4.6652 0.0406
Note that the coefficient matrix, wcoeff
,
is not orthonormal.
Calculate the orthonormal coefficient matrix.
coefforth = inv(diag(std(ingredients)))* wcoeff
coefforth = 0.4760 0.5090 0.6755 0.2411 0.5639 0.4139 0.3144 0.6418 0.3941 0.6050 0.6377 0.2685 0.5479 0.4512 0.1954 0.6767
Check orthonormality of the new coefficient matrix, coefforth
.
coefforth*coefforth'
ans = 1.0000 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 1.0000 0 0.0000 0.0000 0 1.0000
Find the principal components using the alternating least squares (ALS) algorithm when there are missing values in the data.
Load the sample data.
load hald
The ingredients data has 13 observations for 4 variables.
Perform principal component analysis using the ALS algorithm and display the component coefficients.
[coeff,score,latent,tsquared,explained] = pca(ingredients); coeff
coeff = 0.0678 0.6460 0.5673 0.5062 0.6785 0.0200 0.5440 0.4933 0.0290 0.7553 0.4036 0.5156 0.7309 0.1085 0.4684 0.4844
Introduce missing values randomly.
y = ingredients; rng('default'); % for reproducibility ix = random('unif',0,1,size(y))<0.30; y(ix) = NaN
y = 7 26 6 NaN 1 29 15 52 NaN NaN 8 20 11 31 NaN 47 7 52 6 33 NaN 55 NaN NaN NaN 71 NaN 6 1 31 NaN 44 2 NaN NaN 22 21 47 4 26 NaN 40 23 34 11 66 9 NaN 10 68 8 12
Approximately 30% of the data has missing values now, indicated
by NaN
.
Perform principal component analysis using the ALS algorithm and display the component coefficients.
[coeff1,score1,latent,tsquared,explained,mu1] = pca(y,... 'algorithm','als'); coeff1
coeff1 = 0.0362 0.8215 0.5252 0.2190 0.6831 0.0998 0.1828 0.6999 0.0169 0.5575 0.8215 0.1185 0.7292 0.0657 0.1261 0.6694
Display the estimated mean.
mu1
mu1 = 8.9956 47.9088 9.0451 28.5515
Reconstruct the observed data.
t = score1*coeff1' + repmat(mu1,13,1)
t = 7.0000 26.0000 6.0000 51.5250 1.0000 29.0000 15.0000 52.0000 10.7819 53.0230 8.0000 20.0000 11.0000 31.0000 13.5500 47.0000 7.0000 52.0000 6.0000 33.0000 10.4818 55.0000 7.8328 17.9362 3.0982 71.0000 11.9491 6.0000 1.0000 31.0000 0.5161 44.0000 2.0000 53.7914 5.7710 22.0000 21.0000 47.0000 4.0000 26.0000 21.5809 40.0000 23.0000 34.0000 11.0000 66.0000 9.0000 5.7078 10.0000 68.0000 8.0000 12.0000
The ALS algorithm estimates the missing values in the data.
Another way to compare the results is to find the angle between the two spaces spanned by the coefficient vectors. Find the angle between the coefficients found for complete data and data with missing values using ALS.
subspace(coeff,coeff1)
ans = 2.2925e16
This is a small value. It indicates that the results if you
use pca
with 'Rows','complete'
namevalue
pair argument when there is no missing data and if you use pca
with 'algorithm','als'
namevalue
pair argument when there is missing data are close to each other.
Perform the principal component analysis using 'Rows','complete'
namevalue
pair argument and display the component coefficients.
[coeff2,score2,latent,tsquared,explained,mu2] = pca(y,... 'Rows','complete'); coeff2
coeff2 = 0.2054 0.8587 0.0492 0.6694 0.3720 0.5510 0.1474 0.3513 0.5187 0.6986 0.0298 0.6518
In this case, pca
removes the rows with missing
values, and y
has only four rows with no missing
values. pca
returns only three principal components.
You cannot use the 'Rows','pairwise'
option because
the covariance matrix is not positive semidefinite and pca
returns
an error message.
Find the angle between the coefficients found for complete
data and data with missing values using listwise deletion (when 'Rows','complete'
).
subspace(coeff(:,1:3),coeff2)
ans = 0.3576
The angle between the two spaces is substantially larger. This indicates that these two results are different.
Display the estimated mean.
mu2
mu2 = 7.8889 46.9091 9.8750 29.6000
In this case, the mean is just the sample mean of y
.
Reconstruct the observed data.
score2*coeff2'
ans = NaN NaN NaN NaN 7.5162 18.3545 4.0968 22.0056 NaN NaN NaN NaN NaN NaN NaN NaN 0.5644 5.3213 3.3432 3.6040 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 12.8315 0.1076 6.3333 3.7758 NaN NaN NaN NaN NaN NaN NaN NaN 1.4680 20.6342 2.9292 18.0043
This shows that deleting rows containing NaN
values
does not work as well as the ALS algorithm. Using ALS is better when
the data has too many missing values.
Find the coefficients, scores, and variances of the principal components.
Load the sample data set.
load hald
The ingredients data has 13 observations for 4 variables.
Find the principal component coefficients, scores, and variances of the components for the ingredients data.
[coeff,score,latent] = pca(ingredients)
coeff = 0.0678 0.6460 0.5673 0.5062 0.6785 0.0200 0.5440 0.4933 0.0290 0.7553 0.4036 0.5156 0.7309 0.1085 0.4684 0.4844 score = 36.8218 6.8709 4.5909 0.3967 29.6073 4.6109 2.2476 0.3958 12.9818 4.2049 0.9022 1.1261 23.7147 6.6341 1.8547 0.3786 0.5532 4.4617 6.0874 0.1424 10.8125 3.6466 0.9130 0.1350 32.5882 8.9798 1.6063 0.0818 22.6064 10.7259 3.2365 0.3243 9.2626 8.9854 0.0169 0.5437 3.2840 14.1573 7.0465 0.3405 9.2200 12.3861 3.4283 0.4352 25.5849 2.7817 0.3867 0.4468 26.9032 2.9310 2.4455 0.4116 latent = 517.7969 67.4964 12.4054 0.2372
Each column of score
corresponds to one principal
component. The vector, latent
, stores the variances
of the four principal components.
Reconstruct the centered ingredients data.
Xcentered = score*coeff'
Xcentered = 0.4615 22.1538 5.7692 30.0000 6.4615 19.1538 3.2308 22.0000 3.5385 7.8462 3.7692 10.0000 3.5385 17.1538 3.7692 17.0000 0.4615 3.8462 5.7692 3.0000 3.5385 6.8462 2.7692 8.0000 4.4615 22.8462 5.2308 24.0000 6.4615 17.1538 10.2308 14.0000 5.4615 5.8462 6.2308 8.0000 13.5385 1.1538 7.7692 4.0000 6.4615 8.1538 11.2308 4.0000 3.5385 17.8462 2.7692 18.0000 2.5385 19.8462 3.7692 18.0000
The new data in Xcentered
is the original
ingredients data centered by subtracting the column means from corresponding
columns.
Find the Hotelling's Tsquared statistic values.
Load the sample data set.
load hald
The ingredients data has 13 observations for 4 variables.
Perform the principal component analysis and request the Tsquared values.
[coeff,score,latent,tsquared] = pca(ingredients); tsquared
tsquared = 5.6803 3.0758 6.0002 2.6198 3.3681 0.5668 3.4818 3.9794 2.6086 7.4818 4.1830 2.2327 2.7216
Request only the first two principal components and compute the Tsquared values in the reduced space of requested principal components.
[coeff,score,latent,tsquared] = pca(ingredients,'NumComponents',2);
tsquared
tsquared = 5.6803 3.0758 6.0002 2.6198 3.3681 0.5668 3.4818 3.9794 2.6086 7.4818 4.1830 2.2327 2.7216
Note that even when you specify a reduced component space, pca
computes
the Tsquared values in the full space, using all four components.
The Tsquared value in the reduced space corresponds to the Mahalanobis distance in the reduced space.
tsqreduced = mahal(score,score)
tsqreduced = 3.3179 2.0079 0.5874 1.7382 0.2955 0.4228 3.2457 2.6914 1.3619 2.9903 2.4371 1.3788 1.5251
Calculate the Tsquared values in the discarded space by taking the difference of the Tsquared values in the full space and Mahalanobis distance in the reduced space.
tsqdiscarded = tsquared  tsqreduced
tsqdiscarded = 2.3624 1.0679 5.4128 0.8816 3.0726 0.1440 0.2362 1.2880 1.2467 4.4915 1.7459 0.8539 1.1965
Find the percent variability explained by the principal components.
Load the sample data set.
load imports85
Data matrix X
has 13 continuous variables
in columns 3 to 15: wheelbase, length, width, height, curbweight,
enginesize, bore, stroke, compressionratio, horsepower, peakrpm,
citympg, and highwaympg.
Find the percent variability explained by principal components of these variables.
[coeff,score,latent,tsquared,explained] = pca(X(:,3:15)); explained
explained = 64.3429 35.4484 0.1550 0.0379 0.0078 0.0048 0.0013 0.0011 0.0005 0.0002 0.0002 0.0000 0.0000
The first two components explain 99.79% of all variability.
To skip any of the outputs, you can use ~
instead
in the corresponding element. For example, if you don't want
to get the Tsquared values, specify
[coeff,score,latent,~,explained] = pca(X(:,3:15));
X
— Input datamatrixInput data for which to compute the principal components, specified
as an nbyp matrix. Rows of X
correspond
to observations and columns to variables.
Data Types: single
 double
Specify optional commaseparated pairs of Name,Value
arguments.
Name
is the argument
name and Value
is the corresponding
value. Name
must appear
inside single quotes (' '
).
You can specify several name and value pair
arguments in any order as Name1,Value1,...,NameN,ValueN
.
'Algorithm','eig','Centered',false,'Rows','all','NumComponents',3
specifies
that pca
uses eigenvalue decomposition algorithm,
not center the data, use all of the observations, and return only
the first three principal components.'Algorithm'
— Principal component algorithm'svd'
(default)  'eig'
 'als'
Principal component algorithm that pca
uses
to perform the principal component analysis, specified as the commaseparated
pair consisting of 'Algorithm'
and one of the following.
'svd'  Default. Singular value decomposition (SVD) of X . 
'eig'  Eigenvalue decomposition (EIG) of the covariance matrix. The
EIG algorithm is faster than SVD when the number of observations, n,
exceeds the number of variables, p, but is less
accurate because the condition number of the covariance is the square
of the condition number of X . 
'als'  Alternating least squares (ALS) algorithm. This algorithm
finds the best rankk approximation by factoring X into
a nbyk left factor matrix,
L, and a pbyk right factor
matrix, R, where k is the number of principal components.
The factorization uses an iterative method starting with random initial
values.ALS is designed to better handle missing values. It is
preferable to pairwise deletion ( 
Example: 'Algorithm','eig'
Data Types: char
'Centered'
— Indicator for centering columnstrue
(default)  false
Indicator for centering the columns, specified as the commaseparated
pair consisting of 'Centered'
and one of these
logical expressions.
true  Default. 
false  In this case 
Example: 'Centered',false
Data Types: logical
'Economy'
— Indicator for economy size outputtrue
(default)  false
Indicator for the economy size output when the degrees of freedom, d,
is smaller than the number of variables, p, specified
as the commaseparated pair consisting of 'Economy'
and
one of these logical expressions.
true  Default. This option can be significantly faster when the number of variables p is much larger than d. 
false 

Note that when d < p, score(:,d+1:p)
and latent(d+1:p)
are
necessarily zero, and the columns of coeff(:,d+1:p)
define
directions that are orthogonal to X
.
Example: 'Economy',false
Data Types: logical
'NumComponents'
— Number of components requestednumber of variables (default)  scalar integerNumber of components requested, specified as the commaseparated
pair consisting of 'NumComponents'
and a scalar
integer k satisfying 0 < k ≤ p,
where p is the number of original variables in X
.
When specified, pca
returns the first k columns
of coeff
and score
.
Example: 'NumComponents',3
Data Types: single
 double
'Rows'
— Action to take for NaN
values'complete'
(default)  'pairwise'
 'all'
Action to take for NaN
values in the data
matrix X
, specified as the commaseparated pair
consisting of 'Rows'
and one of the following.
'complete'  Default. Observations with 
'pairwise'  This option only applies when the algorithm is When you specify the Note
that the resulting covariance matrix might not be positive definite.
In that case, 
'all' 

Example: 'Rows','pairwise'
Data Types: char
'Weights'
— Observation weightsones (default)  row vectorObservation weights, specified as the commaseparated pair
consisting of 'Weights'
and a vector of length n containing
all positive elements.
Data Types: single
 double
'VariableWeights'
— Variable weightsrow vector  'variance'
Variable weights,
specified as the commaseparated pair consisting of 'VariableWeights'
and
one of the following.
Vector of length p containing all positive elements. 
The string If 
Example: 'VariableWeights','variance'
Data Types: single
 double
 char
'Coeff0'
— Initial value for coefficientsmatrix of random values (default)  pbyk matrixInitial value for the coefficient matrix coeff
,
specified as the commaseparated pair consisting of 'Coeff0'
and
a pbyk matrix, where p is
the number of variables, and k is the number of
principal components requested.
Note:
You can use this namevalue pair only when 
Data Types: single
 double
'Score0'
— Initial value for scoresmatrix of random values (default)  kbym matrixInitial value for scores matrix score
,
specified as a commaseparated pair consisting of 'Score0'
and
an nbyk matrix, where n is
the number of observations and k is the number
of principal components requested.
Note:
You can use this namevalue pair only when 
Data Types: single
 double
'Options'
— Options for iterationsstructureOptions for the iterations, specified as a commaseparated pair
consisting of 'Options'
and a structure created
by the statset
function. pca
uses
the following fields in the options structure.
'Display'  Level of display output. Choices are 'off' , 'final' ,
and 'iter' . 
'MaxIter'  Maximum number steps allowed. The default is 1000. Unlike in
optimization settings, reaching the MaxIter value
is regarded as convergence. 
'TolFun'  Positive number giving the termination tolerance for the cost function. The default is 1e6. 
'TolX'  Positive number giving the convergence threshold for the relative change in the elements of the left and right factor matrices, L and R, in the ALS algorithm. The default is 1e6. 
Note:
You can use this namevalue pair only when 
You can change the values of these fields and specify the new
structure in pca
using the 'Options'
namevalue
pair argument.
Example: opt = statset('pca'); opt.MaxIter = 2000; coeff
= pca(X,'Options',opt);
Data Types: struct
coeff
— Principal component coefficientsmatrixPrincipal component coefficients, returned as a pbyp matrix.
Each column of coeff
contains coefficients for
one principal component. The columns are in the order of descending
component variance, latent
.
score
— Principal component scoresmatrixPrincipal component scores, returned as a matrix. Rows of score
correspond
to observations, and columns to components.
latent
— Principal component variancescolumn vectorPrincipal component variances, that is the eigenvalues of the
covariance matrix of X
, returned as a column
vector.
tsquared
— Hotelling's Tsquared statisticcolumn vectorHotelling's TSquared Statistic, which is the sum of squares of the standardized scores for each observation, returned as a column vector.
explained
— Percentage of total variance explainedcolumn vectorPercentage of the total variance explained by each principal component, returned as a column vector.
mu
— Estimated meansrow vectorHotelling's Tsquared statistic is a statistical measure of the multivariate distance of each observation from the center of the data set.
Even when you request fewer components than the number of variables, pca
uses
all principal components to compute the Tsquared statistic (computes
it in the full space). If you want the Tsquared statistic in the
reduced or the discarded space, do one of the following:
For the Tsquared statistic in the reduced space,
use mahal(score,score)
.
For the Tsquared statistic in the discarded space,
first compute the Tsquared statistic using [coeff,score,latent,tsquared]
= pca(X,'NumComponents',k,...)
, compute the Tsquared statistic
in the reduced space using tsqreduced = mahal(score,score)
,
and then take the difference: tsquared
 tsqreduced
.
The degrees of freedom, d, is equal to n – 1, if data is centered and n otherwise, where:
n is the number of rows without
any NaN
s if you use 'Rows','complete'
.
n is the number of rows without
any NaN
s in the column pair that has the maximum
number of rows without NaN
s if you use 'Rows','pairwise'
.
Note that when variable weights are used, the
coefficient matrix is not orthonormal. Suppose the variable weights
vector you used is called varwei
, and the principal
component coefficients vector pca
returned is wcoeff
.
You can then calculate the orthonormal coefficients using the transformation diag(sqrt(varwei))*wcoeff
.
[1] Jolliffe, I. T. Principal Component Analysis. 2nd ed., Springer, 2002.
[2] Krzanowski, W. J. Principles of Multivariate Analysis. Oxford University Press, 1988.
[3] Seber, G. A. F. Multivariate Observations. Wiley, 1984.
[4] Jackson, J. E. A. User's Guide to Principal Components. Wiley, 1988.
[5] Roweis, S. "EM Algorithms for PCA and SPCA." In Proceedings of the 1997 Conference on Advances in Neural Information Processing Systems. Vol.10 (NIPS 1997), Cambridge, MA, USA: MIT Press, 1998, pp. 626–632.
[6] Ilin, A., and T. Raiko. "Practical Approaches to Principal Component Analysis in the Presence of Missing Values." J. Mach. Learn. Res.. Vol. 11, August 2010, pp. 1957–2000.
You can also select a location from the following list: