Feature transformation is a group of methods that create new features (predictor variables). The methods are useful for dimension reduction when the transformed features have a descriptive power that is more easily ordered than the original features. In this case, less descriptive features can be dropped from consideration when building models.
Feature transformation methods are contrasted with the methods presented in Feature Selection, where dimension reduction is achieved by computing an optimal subset of predictive features measured in the original data.
The methods presented in this section share some common methodology. Their goals, however, are essentially different:
Nonnegative matrix factorization is used when model terms must represent nonnegative quantities, such as physical quantities.
Principal component analysis is used to summarize data in fewer dimensions, for example, to visualize it.
Factor analysis is used to build explanatory models of data correlations.
Nonnegative matrix factorization (NMF) is a dimension-reduction technique based on a low-rank approximation of the feature space. Besides providing a reduction in the number of features, NMF guarantees that the features are nonnegative, producing additive models that respect, for example, the nonnegativity of physical quantities.
Given a nonnegative m-by-n matrix X and a positive integer k < min(m,n), NMF finds nonnegative m-by-k and k-by-n matrices W and H, respectively, that minimize the norm of the difference X – WH. W and H are thus approximate nonnegative factors of X.
The k columns of W represent transformations of the variables in X; the k rows of H represent the coefficients of the linear combinations of the original n variables in X that produce the transformed variables in W. Since k is generally smaller than the rank of X, the product WH provides a compressed approximation of the data in X. A range of possible values for k is often suggested by the modeling context.
The Statistics and Machine Learning Toolbox™ function nnmf
carries out nonnegative matrix factorization. nnmf
uses
one of two iterative algorithms that begin with random initial values
for W and H. Because the norm
of the residual X – WH may
have local minima, repeated calls to nnmf
may
yield different factorizations. Sometimes the algorithm converges
to a solution of lower rank than k, which may indicate
that the result is not optimal.
For example, consider the five predictors of biochemical oxygen
demand in the data set moore.mat
.
load moore X = moore(:,1:5); rng('default'); % For reproducibility
The following uses nnmf
to
compute a rank-two approximation of X
with a multiplicative
update algorithm that begins from five random initial values for W
and H
.
opt = statset('MaxIter',10,'Display','final'); [W0,H0] = nnmf(X,2,'replicates',5,'options',opt,'algorithm','mult');
rep iteration rms resid |delta x| 1 10 358.296 0.00190554 2 10 78.3556 0.000351747 3 10 230.962 0.0172839 4 10 326.347 0.00739552 5 10 361.547 0.00705539 Final root mean square residual = 78.3556
The 'mult'
algorithm is sensitive to initial
values, which makes it a good choice when using 'replicates'
to
find W
and H
from multiple random
starting values.
Now perform the factorization using an alternating least-squares
algorithm, which converges faster and more consistently. Run 100 times
more iterations, beginning from the initial W0
and H0
identified
above.
opt = statset('Maxiter',1000,'Display','final'); [W,H] = nnmf(X,2,'w0',W0,'h0',H0,'options',opt,'algorithm','als');
rep iteration rms resid |delta x| 1 2 77.5315 0.000830334 Final root mean square residual = 77.5315
The two columns of W
are the transformed
predictors. The two rows of H
give the relative
contributions of each of the five predictors in X
to
the predictors in W
.
H
H = 0.0835 0.0190 0.1782 0.0072 0.9802 0.0559 0.0250 0.9969 0.0085 0.0497
The fifth predictor in X
(weight 0.9802
)
strongly influences the first predictor in W
. The
third predictor in X
(weight 0.9969
)
strongly influences the second predictor in W
.
Visualize the relative contributions of the predictors in X
with
a biplot
, showing the data and
original variables in the column space of W
.
biplot(H','scores',W,'varlabels',{'','','v3','','v5'}); axis([0 1.1 0 1.1]) xlabel('Column 1') ylabel('Column 2')
One of the difficulties inherent in multivariate statistics
is the problem of visualizing data that has many variables. The MATLAB^{®} function plot
displays a graph of the relationship
between two variables. The plot3
and surf
commands display different three-dimensional
views. But when there are more than three variables, it is more difficult
to visualize their relationships.
Fortunately, in data sets with many variables, groups of variables often move together. One reason for this is that more than one variable might be measuring the same driving principle governing the behavior of the system. In many systems there are only a few such driving forces. But an abundance of instrumentation enables you to measure dozens of system variables. When this happens, you can take advantage of this redundancy of information. You can simplify the problem by replacing a group of variables with a single new variable.
Principal component analysis is a quantitatively rigorous method for achieving this simplification. The method generates a new set of variables, called principal components. Each principal component is a linear combination of the original variables. All the principal components are orthogonal to each other, so there is no redundant information. The principal components as a whole form an orthogonal basis for the space of the data.
There are an infinite number of ways to construct an orthogonal basis for several columns of data. What is so special about the principal component basis?
The first principal component is a single axis in space. When you project each observation on that axis, the resulting values form a new variable. And the variance of this variable is the maximum among all possible choices of the first axis.
The second principal component is another axis in space, perpendicular to the first. Projecting the observations on this axis generates another new variable. The variance of this variable is the maximum among all possible choices of this second axis.
The full set of principal components is as large as the original set of variables. But it is commonplace for the sum of the variances of the first few principal components to exceed 80% of the total variance of the original data. By examining plots of these few new variables, researchers often develop a deeper understanding of the driving forces that generated the original data.
You can use the function pca
to
find the principal components. To use pca
, you
need to have the actual measured data you want to analyze. However,
if you lack the actual data, but have the sample covariance or correlation
matrix for the data, you can still use the function pcacov
to
perform a principal components analysis. See the reference page for pcacov
for a description of its inputs
and outputs.
This example shows how to perform a weighted principal components analysis and interpret the results.
Load sample data.
Load the sample data. The data includes ratings for 9 different indicators of the quality of life in 329 U.S. cities. These are climate, housing, health, crime, transportation, education, arts, recreation, and economics. For each category, a higher rating is better. For example, a higher rating for crime means a lower crime rate.
Display the categories
variable.
load cities
categories
categories = climate housing health crime transportation education arts recreation economics
In total, the cities
data set contains three
variables:
categories
, a string matrix containing
the names of the indices
names
, a string matrix containing
the 329 city names
ratings
, the data matrix with 329
rows and 9 columns
Plot data.
Make a boxplot to look at the distribution of the ratings
data.
figure() boxplot(ratings,'orientation','horizontal','labels',categories)
There is more variability in the ratings of the arts and housing than in the ratings of crime and climate.
Check pairwise correlation.
Check the pairwise correlation between the variables.
C = corr(ratings,ratings);
The correlation among some variables is as high as 0.85. Principal components analysis constructs independent new variables which are linear combinations of the original variables.
Compute principal components.
When all variables are in the same unit, it is appropriate to compute principal components for raw data. When the variables are in different units or the difference in the variance of different columns is substantial (as in this case), scaling of the data or use of weights is often preferable.
Perform the principal component analysis by using the inverse variances of the ratings as weights.
w = 1./var(ratings); [wcoeff,score,latent,tsquared,explained] = pca(ratings,... 'VariableWeights',w);
Or equivalently:
[wcoeff,score,latent,tsquared,explained] = pca(ratings,... 'VariableWeights','variance');
The following sections explain the five outputs of pca
.
Component coefficients.
The first output, wcoeff
, contains the coefficients
of the principal components.
The first three principal component coefficient vectors are:
c3 = wcoeff(:,1:3)
c3 = wcoeff(:,1:3) c3 = 1.0e+03 * 0.0249 -0.0263 -0.0834 0.8504 -0.5978 -0.4965 0.4616 0.3004 -0.0073 0.1005 -0.1269 0.0661 0.5096 0.2606 0.2124 0.0883 0.1551 0.0737 2.1496 0.9043 -0.1229 0.2649 -0.3106 -0.0411 0.1469 -0.5111 0.6586
These coefficients are weighted, hence the coefficient matrix is not orthonormal.
Transform coefficients.
Transform the coefficients so that they are orthonormal.
coefforth = inv(diag(std(ratings)))*wcoeff;
Note that if you use a weights vector, w
,
while conducting the pca
, then
coefforth = diag(sqrt(w))*wcoeff;
Check coefficients are orthonormal.
The transformed coefficients are now orthonormal.
I = c3'*c3
I = 1.0000 -0.0000 -0.0000 -0.0000 1.0000 -0.0000 -0.0000 -0.0000 1.0000
Component scores.
The second output, score
, contains the coordinates
of the original data in the new coordinate system defined by the principal
components. The score
matrix is the same size as
the input data matrix. You can also obtain the component scores using
the orthonormal coefficients and the standardized ratings as follows.
cscores = zscore(ratings)*coefforth;
cscores
and score
are
identical matrices.
Plot component scores.
Create a plot of the first two columns of score
.
figure() plot(score(:,1),score(:,2),'+') xlabel('1st Principal Component') ylabel('2nd Principal Component')
This plot shows the centered and scaled ratings data projected
onto the first two principal components. pca
computes
the scores to have mean zero.
Explore plot interactively.
Note the outlying points in the right half of the plot. You can graphically identify these points as follows.
gname
Move your cursor over the plot and click once near the rightmost seven points. This labels the points by their row numbers as in the following figure.
After labeling points, press Return.
Extract observation names.
Create an index variable containing the row numbers of all the cities you chose and get the names of the cities.
metro = [43 65 179 213 234 270 314]; names(metro,:)
ans = Boston, MA Chicago, IL Los Angeles, Long Beach, CA New York, NY Philadelphia, PA-NJ San Francisco, CA Washington, DC-MD-VA
These labeled cities are some of the biggest population centers in the United States and they appear more extreme than the remainder of the data.
Component variances.
The third output, latent
, is a vector containing
the variance explained by the corresponding principal component. Each
column of score
has a sample variance equal to
the corresponding row of latent
.
latent
latent = 3.4083 1.2140 1.1415 0.9209 0.7533 0.6306 0.4930 0.3180 0.1204
Percent variance explained.
The fifth output, explained
, is a vector
containing the percent variance explained by the corresponding principal
component.
explained
explained = 37.8699 13.4886 12.6831 10.2324 8.3698 7.0062 5.4783 3.5338 1.3378
Create scree plot.
Make a scree plot of the percent variability explained by each principal component.
figure() pareto(explained) xlabel('Principal Component') ylabel('Variance Explained (%)')
This scree plot only shows the first seven (instead of the total nine) components that explain 95% of the total variance. The only clear break in the amount of variance accounted for by each component is between the first and second components. However, the first component by itself explains less than 40% of the variance, so more components might be needed. You can see that the first three principal components explain roughly two-thirds of the total variability in the standardized ratings, so that might be a reasonable way to reduce the dimensions.
Hotelling's T-squared statistic.
The last output from pca
is tsquared
,
which is Hotelling's T^{2}, a statistical
measure of the multivariate distance of each observation from the
center of the data set. This is an analytical way to find the most
extreme points in the data.
[st2,index] = sort(tsquared,'descend'); % sort in descending order extreme = index(1); names(extreme,:)
ans = New York, NY
The ratings for New York are the furthest from the average U.S. city.
Visualize the results.
Visualize both the orthonormal principal component coefficients for each variable and the principal component scores for each observation in a single plot.
biplot(coefforth(:,1:2),'scores',score(:,1:2),'varlabels',categories); axis([-.26 0.6 -.51 .51]);
All nine variables are represented in this bi-plot by a vector,
and the direction and length of the vector indicate how each variable
contributes to the two principal components in the plot. For example,
the first principal component, on the horizontal axis, has positive
coefficients for all nine variables. That is why the nine vectors
are directed into the right half of the plot. The largest coefficients
in the first principal component are the third and seventh elements,
corresponding to the variables health
and arts
.
The second principal component, on the vertical axis, has positive
coefficients for the variables education
, health
, arts
,
and transportation
, and negative coefficients for
the remaining five variables. This indicates that the second component
distinguishes among cities that have high values for the first set
of variables and low for the second, and cities that have the opposite.
The variable labels in this figure are somewhat crowded. You
can either exclude the VarLabels
parameter when
making the plot, or select and drag some of the labels to better positions
using the Edit Plot tool from the figure window toolbar.
This 2-D bi-plot also includes a point for each of the 329 observations, with coordinates indicating the score of each observation for the two principal components in the plot. For example, points near the left edge of this plot have the lowest scores for the first principal component. The points are scaled with respect to the maximum score value and maximum coefficient length, so only their relative locations can be determined from the plot.
You can identify items in the plot by selecting Tools>Data
Cursor
from the figure window. By clicking a variable
(vector), you can read that variable's coefficients for each principal
component. By clicking an observation (point), you can read that observation's
scores for each principal component.
Create a three-dimensional bi-plot.
You can also make a bi-plot in three dimensions.
figure() biplot(coefforth(:,1:3),'scores',score(:,1:3),'obslabels',names); axis([-.26 0.8 -.51 .51 -.61 .81]); view([30 40]);
This graph is useful if the first two principal coordinates
do not explain enough of the variance in your data. You can also rotate
the figure to see it from different angles by selecting theTools> Rotate
3D
.
Multivariate data often includes a large number of measured variables, and sometimes those variables overlap, in the sense that groups of them might be dependent. For example, in a decathlon, each athlete competes in 10 events, but several of them can be thought of as speed events, while others can be thought of as strength events, etc. Thus, you can think of a competitor's 10 event scores as largely dependent on a smaller set of three or four types of athletic ability.
Factor analysis is a way to fit a model to multivariate data to estimate just this sort of interdependence. In a factor analysis model, the measured variables depend on a smaller number of unobserved (latent) factors. Because each factor might affect several variables in common, they are known as common factors. Each variable is assumed to be dependent on a linear combination of the common factors, and the coefficients are known as loadings. Each measured variable also includes a component due to independent random variability, known as specific variance because it is specific to one variable.
Specifically, factor analysis assumes that the covariance matrix of your data is of the form
$${\sum}_{x}=\Lambda {\Lambda}^{{\rm T}}}+\Psi $$
where Λ is the matrix of loadings, and the elements of
the diagonal matrix Ψ are the specific variances.
The function factoran
fits the
Factor Analysis model using maximum likelihood.
Factor Loadings. Over the course of 100 weeks, the percent change in stock prices for ten companies has been recorded. Of the ten companies, the first four can be classified as primarily technology, the next three as financial, and the last three as retail. It seems reasonable that the stock prices for companies that are in the same sector might vary together as economic conditions change. Factor Analysis can provide quantitative evidence that companies within each sector do experience similar week-to-week changes in stock price.
In this example, you first load the data, and then call factoran
, specifying a model fit with three
common factors. By default, factoran
computes rotated
estimates of the loadings to try and make their interpretation simpler.
But in this example, you specify an unrotated solution.
load stockreturns [Loadings,specificVar,T,stats] = factoran(stocks,3,'rotate','none');
The first two factoran
return arguments are
the estimated loadings and the estimated specific variances. Each
row of the loadings matrix represents one of the ten stocks, and each
column corresponds to a common factor. With unrotated estimates, interpretation
of the factors in this fit is difficult because most of the stocks
contain fairly large coefficients for two or more factors.
Loadings
Loadings = 0.8885 0.2367 -0.2354 0.7126 0.3862 0.0034 0.3351 0.2784 -0.0211 0.3088 0.1113 -0.1905 0.6277 -0.6643 0.1478 0.4726 -0.6383 0.0133 0.1133 -0.5416 0.0322 0.6403 0.1669 0.4960 0.2363 0.5293 0.5770 0.1105 0.1680 0.5524
Note
Factor Rotation helps to simplify
the structure in the |
From the estimated specific variances, you can see that the model indicates that a particular stock price varies quite a lot beyond the variation due to the common factors.
specificVar
specificVar = 0.0991 0.3431 0.8097 0.8559 0.1429 0.3691 0.6928 0.3162 0.3311 0.6544
A specific variance of 1 would indicate that there is no common factor component in that variable, while a specific variance of 0 would indicate that the variable is entirely determined by common factors. These data seem to fall somewhere in between.
The p value returned in the stats
structure
fails to reject the null hypothesis of three common factors, suggesting
that this model provides a satisfactory explanation of the covariation
in these data.
stats.p
ans = 0.8144
To determine whether fewer than three factors can provide an acceptable fit, you can try a model with two common factors. The p value for this second fit is highly significant, and rejects the hypothesis of two factors, indicating that the simpler model is not sufficient to explain the pattern in these data.
[Loadings2,specificVar2,T2,stats2] = factoran(stocks, 2,'rotate','none'); stats2.p
ans = 3.5610e-06
Factor Rotation. As the results illustrate, the estimated loadings from an unrotated factor analysis fit can have a complicated structure. The goal of factor rotation is to find a parameterization in which each variable has only a small number of large loadings. That is, each variable is affected by a small number of factors, preferably only one. This can often make it easier to interpret what the factors represent.
If you think of each row of the loadings matrix as coordinates of a point in M-dimensional space, then each factor corresponds to a coordinate axis. Factor rotation is equivalent to rotating those axes and computing new loadings in the rotated coordinate system. There are various ways to do this. Some methods leave the axes orthogonal, while others are oblique methods that change the angles between them. For this example, you can rotate the estimated loadings by using the promax criterion, a common oblique method.
[LoadingsPM,specVarPM] = factoran(stocks,3,'rotate','promax'); LoadingsPM
LoadingsPM = 0.9452 0.1214 -0.0617 0.7064 -0.0178 0.2058 0.3885 -0.0994 0.0975 0.4162 -0.0148 -0.1298 0.1021 0.9019 0.0768 0.0873 0.7709 -0.0821 -0.1616 0.5320 -0.0888 0.2169 0.2844 0.6635 0.0016 -0.1881 0.7849 -0.2289 0.0636 0.6475
Promax rotation creates a simpler structure in the loadings,
one in which most of the stocks have a large loading on only one factor.
To see this structure more clearly, you can use the biplot
function to plot each stock using
its factor loadings as coordinates.
biplot(LoadingsPM,'varlabels',num2str((1:10)')); axis square view(155,27);
This plot shows that promax has rotated the factor loadings to a simpler structure. Each stock depends primarily on only one factor, and it is possible to describe each factor in terms of the stocks that it affects. Based on which companies are near which axes, you could reasonably conclude that the first factor axis represents the financial sector, the second retail, and the third technology. The original conjecture, that stocks vary primarily within sector, is apparently supported by the data.
Factor Scores. Sometimes, it is useful to be able to classify an observation based on its factor scores. For example, if you accepted the three-factor model and the interpretation of the rotated factors, you might want to categorize each week in terms of how favorable it was for each of the three stock sectors, based on the data from the 10 observed stocks.
Because the data in this example are the raw stock price changes,
and not just their correlation matrix, you can have factoran
return estimates of the value
of each of the three rotated common factors for each week. You can
then plot the estimated scores to see how the different stock sectors
were affected during each week.
[LoadingsPM,specVarPM,TPM,stats,F] = factoran(stocks, 3,'rotate','promax'); plot3(F(:,1),F(:,2),F(:,3),'b.') line([-4 4 NaN 0 0 NaN 0 0], [0 0 NaN -4 4 NaN 0 0],[0 0 NaN 0 0 NaN -4 4], 'Color','black') xlabel('Financial Sector') ylabel('Retail Sector') zlabel('Technology Sector') grid on axis square view(-22.5, 8)
Oblique rotation often creates factors that are correlated. This plot shows some evidence of correlation between the first and third factors, and you can investigate further by computing the estimated factor correlation matrix.
inv(TPM'*TPM)
ans = 1.0000 0.1559 0.4082 0.1559 1.0000 -0.0559 0.4082 -0.0559 1.0000
Visualize the Results. You can use the biplot
function
to help visualize both the factor loadings for each variable and the
factor scores for each observation in a single plot. For example,
the following command plots the results from the factor analysis on
the stock data and labels each of the 10 stocks.
biplot(LoadingsPM,'scores',F,'varlabels',num2str((1:10)')) xlabel('Financial Sector') ylabel('Retail Sector') zlabel('Technology Sector') axis square view(155,27)
In this case, the factor analysis includes three factors, and so the biplot is three-dimensional. Each of the 10 stocks is represented in this plot by a vector, and the direction and length of the vector indicates how each stock depends on the underlying factors. For example, you have seen that after promax rotation, the first four stocks have positive loadings on the first factor, and unimportant loadings on the other two factors. That first factor, interpreted as a financial sector effect, is represented in this biplot as one of the horizontal axes. The dependence of those four stocks on that factor corresponds to the four vectors directed approximately along that axis. Similarly, the dependence of stocks 5, 6, and 7 primarily on the second factor, interpreted as a retail sector effect, is represented by vectors directed approximately along that axis.
Each of the 100 observations is represented in this plot by a point, and their locations indicate the score of each observation for the three factors. For example, points near the top of this plot have the highest scores for the technology sector factor. The points are scaled to fit within the unit square, so only their relative locations can be determined from the plot.
You can use the Data Cursor tool from the Tools menu in the figure window to identify the items in this plot. By clicking a stock (vector), you can read off that stock's loadings for each factor. By clicking an observation (point), you can read off that observation's scores for each factor.