| Statistics Toolbox™ | ![]() |
| On this page… |
|---|
Feature transformation is a group of methods that create new features (predictor variables). The methods are useful for dimension reduction when the transformed features have a descriptive power that is more easily ordered than the original features. In this case, less descriptive features can be dropped from consideration when building models.
Feature transformation methods are contrasted with the methods presented in Feature Selection, where dimension reduction is achieved by computing an optimal subset of predictive features measured in the original data.
The methods presented in this section share some common methodology. Their goals, however, are essentially different:
Nonnegative matrix factorization is used when model terms must represent nonnegative quantities, such as physical quantities.
Principal component analysis is used to summarize data in fewer dimensions, for example, to visualize it.
Factor analysis is used to build explanatory models of data correlations.
Nonnegative matrix factorization (NMF) is a dimension-reduction technique based on a low-rank approximation of the feature space. Besides providing a reduction in the number of features, NMF guarantees that the features are nonnegative, producing additive models that respect, for example, the nonnegativity of physical quantities.
Given a nonnegative m-by-n matrix X and a positive integer k < min(m,n), NMF finds nonnegative m-by-k and k-by-n matrices W and H, respectively, that minimize the norm of the difference X – WH. W and H are thus approximate nonnegative factors of X.
The k columns of W represent transformations of the variables in X; the k rows of H represent the coefficients of the linear combinations of the original n variables in X that produce the transformed variables in W. Since k is generally smaller than the rank of X, the product WH provides a compressed approximation of the data in X. A range of possible values for k is often suggested by the modeling context.
The Statistics Toolbox™ function nnmf carries out nonnegative matrix factorization. nnmf uses one of two iterative algorithms that begin with random initial values for W and H. Because the norm of the residual X – WH may have local minima, repeated calls to nnmf may yield different factorizations. Sometimes the algorithm converges to a solution of lower rank than k, which may indicate that the result is not optimal.
For example, consider the five predictors of biochemical oxygen demand in the data set moore.mat:
load moore X = moore(:,1:5);
The following uses nnmf to compute a rank-two approximation of X with a multiplicative update algorithm that begins from five random initial values for W and H:
opt = statset('MaxIter',10,'Display','final');
[W0,H0] = nnmf(X,2,'replicates',5,...
'options',opt,...
'algorithm','mult');
rep iteration rms resid |delta x|
1 10 358.296 0.00190554
2 10 78.3556 0.000351747
3 10 230.962 0.0172839
4 10 326.347 0.00739552
5 10 361.547 0.00705539
Final root mean square residual = 78.3556The 'mult' algorithm is sensitive to initial values, which makes it a good choice when using 'replicates' to find W and H from multiple random starting values.
Now perform the factorization using an alternating least-squares algorithm, which converges faster and more consistently. Run 100 times more iterations, beginning from the initial W0 and H0 identified above:
opt = statset('Maxiter',1000,'Display','final');
[W,H] = nnmf(X,2,'w0',W0,'h0',H0,...
'options',opt,...
'algorithm','als');
rep iteration rms resid |delta x|
1 3 77.5315 3.52673e-005
Final root mean square residual = 77.5315The two columns of W are the transformed predictors. The two rows of H give the relative contributions of each of the five predictors in X to the predictors in W:
H
H =
0.0835 0.0190 0.1782 0.0072 0.9802
0.0558 0.0250 0.9969 0.0085 0.0497The fifth predictor in X (weight 0.9802) strongly influences the first predictor in W. The third predictor in X (weight 0.9969) strongly influences the second predictor in W.
Visualize the relative contributions of the predictors in X with a biplot, showing the data and original variables in the column space of W:
biplot(H','scores',W,'varlabels',{'','','v3','','v5'});
axis([0 1.1 0 1.1])
xlabel('Column 1')
ylabel('Column 2')

One of the difficulties inherent in multivariate statistics is the problem of visualizing data that has many variables. The MATLAB® function plot displays a graph of the relationship between two variables. The plot3 and surf commands display different three-dimensional views. But when there are more than three variables, it is more difficult to visualize their relationships.
Fortunately, in data sets with many variables, groups of variables often move together. One reason for this is that more than one variable might be measuring the same driving principle governing the behavior of the system. In many systems there are only a few such driving forces. But an abundance of instrumentation enables you to measure dozens of system variables. When this happens, you can take advantage of this redundancy of information. You can simplify the problem by replacing a group of variables with a single new variable.
Principal component analysis is a quantitatively rigorous method for achieving this simplification. The method generates a new set of variables, called principal components. Each principal component is a linear combination of the original variables. All the principal components are orthogonal to each other, so there is no redundant information. The principal components as a whole form an orthogonal basis for the space of the data.
There are an infinite number of ways to construct an orthogonal basis for several columns of data. What is so special about the principal component basis?
The first principal component is a single axis in space. When you project each observation on that axis, the resulting values form a new variable. And the variance of this variable is the maximum among all possible choices of the first axis.
The second principal component is another axis in space, perpendicular to the first. Projecting the observations on this axis generates another new variable. The variance of this variable is the maximum among all possible choices of this second axis.
The full set of principal components is as large as the original set of variables. But it is commonplace for the sum of the variances of the first few principal components to exceed 80% of the total variance of the original data. By examining plots of these few new variables, researchers often develop a deeper understanding of the driving forces that generated the original data.
You can use the function princomp to find the principal components. To use princomp, you need to have the actual measured data you want to analyze. However, if you lack the actual data, but have the sample covariance or correlation matrix for the data, you can still use the function pcacov to perform a principal components analysis. See the reference page for pcacov for a description of its inputs and outputs.
Computing Components. Consider a sample application that uses nine different indices of the quality of life in 329 U.S. cities. These are climate, housing, health, crime, transportation, education, arts, recreation, and economics. For each index, higher is better. For example, a higher index for crime means a lower crime rate.
Start by loading the data in cities.mat.
load cities whos Name Size Bytes Class categories 9x14 252 char array names 329x43 28294 char array ratings 329x9 23688 double array
The whos command generates a table of information about all the variables in the workspace.
The cities data set contains three variables:
categories, a string matrix containing the names of the indices
names, a string matrix containing the 329 city names
ratings, the data matrix with 329 rows and 9 columns
The categories variable has the following values:
categories categories = climate housing health crime transportation education arts recreation economics
The first five rows of names are
first5 = names(1:5,:) first5 = Abilene, TX Akron, OH Albany, GA Albany-Troy, NY Albuquerque, NM
To get a quick impression of the ratings data, make a box plot.
boxplot(ratings,'orientation','horizontal','labels',categories)
This command generates the plot below. Note that there is substantially more variability in the ratings of the arts and housing than in the ratings of crime and climate.

Ordinarily you might also graph pairs of the original variables, but there are 36 two-variable plots. Perhaps principal components analysis can reduce the number of variables you need to consider.
Sometimes it makes sense to compute principal components for raw data. This is appropriate when all the variables are in the same units. Standardizing the data is often preferable when the variables are in different units or when the variance of the different columns is substantial (as in this case).
You can standardize the data by dividing each column by its standard deviation.
stdr = std(ratings); sr = ratings./repmat(stdr,329,1);
Now you are ready to find the principal components.
[coefs,scores,variances,t2] = princomp(sr);
The following sections explain the four outputs from princomp.
Component Coefficients. The first output of the princomp function, coefs, contains the coefficients of the linear combinations of the original variables that generate the principal components. The coefficients are also known as loadings.
The first three principal component coefficient vectors are:
c3 = coefs(:,1:3)
c3 =
0.2064 0.2178 -0.6900
0.3565 0.2506 -0.2082
0.4602 -0.2995 -0.0073
0.2813 0.3553 0.1851
0.3512 -0.1796 0.1464
0.2753 -0.4834 0.2297
0.4631 -0.1948 -0.0265
0.3279 0.3845 -0.0509
0.1354 0.4713 0.6073The largest coefficients in the first column (first principal component) are the third and seventh elements, corresponding to the variables health and arts. All the coefficients of the first principal component have the same sign, making it a weighted average of all the original variables.
The principal components are unit length and orthogonal:
I = c3'*c3
I =
1.0000 -0.0000 -0.0000
-0.0000 1.0000 -0.0000
-0.0000 -0.0000 1.0000Component Scores. The second output, scores, contains the coordinates of the original data in the new coordinate system defined by the principal components. This output is the same size as the input data matrix.
A plot of the first two columns of scores shows the ratings data projected onto the first two principal components. princomp computes the scores to have mean zero.
plot(scores(:,1),scores(:,2),'+')
xlabel('1st Principal Component')
ylabel('2nd Principal Component')

Note the outlying points in the right half of the plot.
While it is possible to create a three-dimensional plot using three columns of scores, the examples in this section create two-dimensional plots, which are easier to describe.
The function gname is useful for graphically identifying a few points in a plot like this. You can call gname with a string matrix containing as many case labels as points in the plot. The string matrix names works for labeling points with the city names.
gname(names)
Move your cursor over the plot and click once near each point in the right half. As you click each point, it is labeled with the proper row from the names string matrix. Here is the plot after a few clicks:

When you are finished labeling points, press the Return key.
The labeled cities are some of the biggest population centers in the United States. They are definitely different from the remainder of the data, so perhaps they should be considered separately. To remove the labeled cities from the data, first identify their corresponding row numbers as follows:
plot(scores(:,1),scores(:,2),'+')
xlabel('1st Principal Component');
ylabel('2nd Principal Component');Click near the points you labeled in the preceding figure. This labels the points by their row numbers, as shown in the following figure.

Then you can create an index variable containing the row numbers of all the metropolitan areas you choose.
metro = [43 65 179 213 234 270 314]; names(metro,:) ans = Boston, MA Chicago, IL Los Angeles, Long Beach, CA New York, NY Philadelphia, PA-NJ San Francisco, CA Washington, DC-MD-VA
To remove these rows from the ratings matrix, enter the following.
rsubset = ratings; nsubset = names; nsubset(metro,:) = []; rsubset(metro,:) = []; size(rsubset) ans = 322 9
Component Variances. The third output, variances, is a vector containing the variance explained by the corresponding principal component. Each column of scores has a sample variance equal to the corresponding element of variances.
variances
variances =
3.4083
1.2140
1.1415
0.9209
0.7533
0.6306
0.4930
0.3180
0.1204You can easily calculate the percent of the total variability explained by each principal component.
percent_explained = 100*variances/sum(variances)
percent_explained =
37.8699
13.4886
12.6831
10.2324
8.3698
7.0062
5.4783
3.5338
1.3378Use the pareto function to make a scree plot of the percent variability explained by each principal component.
pareto(percent_explained)
xlabel('Principal Component')
ylabel('Variance Explained (%)')

The preceding figure shows that the only clear break in the amount of variance accounted for by each component is between the first and second components. However, that component by itself explains less than 40% of the variance, so more components are probably needed. You can see that the first three principal components explain roughly two-thirds of the total variability in the standardized ratings, so that might be a reasonable way to reduce the dimensions in order to visualize the data.
Hotelling's T2. The last output of the princomp function, t2, is Hotelling's T2, a statistical measure of the multivariate distance of each observation from the center of the data set. This is an analytical way to find the most extreme points in the data.
[st2, index] = sort(t2,'descend'); % Sort in descending order. extreme = index(1) extreme = 213 names(extreme,:) ans = New York, NY
It is not surprising that the ratings for New York are the furthest from the average U.S. town.
Visualizing the Results. Use the biplot function to help visualize both the principal component coefficients for each variable and the principal component scores for each observation in a single plot. For example, the following command plots the results from the principal components analysis on the cities and labels each of the variables.
biplot(coefs(:,1:2), 'scores',scores(:,1:2),... 'varlabels',categories); axis([-.26 1 -.51 .51]);

Each of the nine variables is represented in this plot by a vector, and the direction and length of the vector indicates how each variable contributes to the two principal components in the plot. For example, you have seen that the first principal component, represented in this biplot by the horizontal axis, has positive coefficients for all nine variables. That corresponds to the nine vectors directed into the right half of the plot. You have also seen that the second principal component, represented by the vertical axis, has positive coefficients for the variables education, health, arts, and transportation, and negative coefficients for the remaining five variables. That corresponds to vectors directed into the top and bottom halves of the plot, respectively. This indicates that this component distinguishes between cities that have high values for the first set of variables and low for the second, and cities that have the opposite.
The variable labels in this figure are somewhat crowded. You could either leave out the VarLabels parameter when making the plot, or simply select and drag some of the labels to better positions using the Edit Plot tool from the figure window toolbar.
Each of the 329 observations is represented in this plot by a point, and their locations indicate the score of each observation for the two principal components in the plot. For example, points near the left edge of this plot have the lowest scores for the first principal component. The points are scaled to fit within the unit square, so only their relative locations may be determined from the plot.
You can use the Data Cursor, in the Tools menu in the figure window, to identify the items in this plot. By clicking on a variable (vector), you can read off that variable's coefficients for each principal component. By clicking on an observation (point), you can read off that observation's scores for each principal component.
You can also make a biplot in three dimensions. This can be useful if the first two principal coordinates do not explain enough of the variance in your data. Selecting Rotate 3D in the Tools menu enables you to rotate the figure to see it from different angles.
biplot(coefs(:,1:3), 'scores',scores(:,1:3),... 'obslabels',names); axis([-.26 1 -.51 .51 -.61 .81]); view([30 40]);

Multivariate data often includes a large number of measured variables, and sometimes those variables overlap, in the sense that groups of them might be dependent. For example, in a decathlon, each athlete competes in 10 events, but several of them can be thought of as speed events, while others can be thought of as strength events, etc. Thus, you can think of a competitor's 10 event scores as largely dependent on a smaller set of three or four types of athletic ability.
Factor analysis is a way to fit a model to multivariate data to estimate just this sort of interdependence. In a factor analysis model, the measured variables depend on a smaller number of unobserved (latent) factors. Because each factor might affect several variables in common, they are known as common factors. Each variable is assumed to be dependent on a linear combination of the common factors, and the coefficients are known as loadings. Each measured variable also includes a component due to independent random variability, known as specific variance because it is specific to one variable.
Specifically, factor analysis assumes that the covariance matrix of your data is of the form
![]()
where
is the matrix of loadings, and the elements of
the diagonal matrix
are the specific variances. The function factoran fits the Factor Analysis model
using maximum likelihood.
Factor Loadings. Over the course of 100 weeks, the percent change in stock prices for ten companies has been recorded. Of the ten companies, the first four can be classified as primarily technology, the next three as financial, and the last three as retail. It seems reasonable that the stock prices for companies that are in the same sector might vary together as economic conditions change. Factor Analysis can provide quantitative evidence that companies within each sector do experience similar week-to-week changes in stock price.
In this example, you first load the data, and then call factoran, specifying a model fit with three common factors. By default, factoran computes rotated estimates of the loadings to try and make their interpretation simpler. But in this example, you specify an unrotated solution.
load stockreturns [Loadings,specificVar,T,stats] = ... factoran(stocks,3,'rotate','none');
The first two factoran return arguments are the estimated loadings and the estimated specific variances. Each row of the loadings matrix represents one of the ten stocks, and each column corresponds to a common factor. With unrotated estimates, interpretation of the factors in this fit is difficult because most of the stocks contain fairly large coefficients for two or more factors.
Loadings
Loadings =
0.8885 0.2367 -0.2354
0.7126 0.3862 0.0034
0.3351 0.2784 -0.0211
0.3088 0.1113 -0.1905
0.6277 -0.6643 0.1478
0.4726 -0.6383 0.0133
0.1133 -0.5416 0.0322
0.6403 0.1669 0.4960
0.2363 0.5293 0.5770
0.1105 0.1680 0.5524Note Factor Rotation helps to simplify the structure in the Loadings matrix, to make it easier to assign meaningful interpretations to the factors. |
From the estimated specific variances, you can see that the model indicates that a particular stock price varies quite a lot beyond the variation due to the common factors.
specificVar
specificVar =
0.0991
0.3431
0.8097
0.8559
0.1429
0.3691
0.6928
0.3162
0.3311
0.6544A specific variance of 1 would indicate that there is no common factor component in that variable, while a specific variance of 0 would indicate that the variable is entirely determined by common factors. These data seem to fall somewhere in between.
The p-value returned in the stats structure fails to reject the null hypothesis of three common factors, suggesting that this model provides a satisfactory explanation of the covariation in these data.
stats.p
ans =
0.8144To determine whether fewer than three factors can provide an acceptable fit, you can try a model with two common factors. The p-value for this second fit is highly significant, and rejects the hypothesis of two factors, indicating that the simpler model is not sufficient to explain the pattern in these data.
[Loadings2,specificVar2,T2,stats2] = ...
factoran(stocks, 2,'rotate','none');
stats2.p
ans =
3.5610e-006Factor Rotation. As the results illustrate, the estimated loadings from an unrotated factor analysis fit can have a complicated structure. The goal of factor rotation is to find a parameterization in which each variable has only a small number of large loadings. That is, each variable is affected by a small number of factors, preferably only one. This can often make it easier to interpret what the factors represent.
If you think of each row of the loadings matrix as coordinates of a point in M-dimensional space, then each factor corresponds to a coordinate axis. Factor rotation is equivalent to rotating those axes and computing new loadings in the rotated coordinate system. There are various ways to do this. Some methods leave the axes orthogonal, while others are oblique methods that change the angles between them. For this example, you can rotate the estimated loadings by using the promax criterion, a common oblique method.
[LoadingsPM,specVarPM] = factoran(stocks,3,'rotate','promax');
LoadingsPM
LoadingsPM =
0.9452 0.1214 -0.0617
0.7064 -0.0178 0.2058
0.3885 -0.0994 0.0975
0.4162 -0.0148 -0.1298
0.1021 0.9019 0.0768
0.0873 0.7709 -0.0821
-0.1616 0.5320 -0.0888
0.2169 0.2844 0.6635
0.0016 -0.1881 0.7849
-0.2289 0.0636 0.6475Promax rotation creates a simpler structure in the loadings, one in which most of the stocks have a large loading on only one factor. To see this structure more clearly, you can use the biplot function to plot each stock using its factor loadings as coordinates.
biplot(LoadingsPM,'varlabels',num2str((1:10)')); axis square view(155,27);

This plot shows that promax has rotated the factor loadings to a simpler structure. Each stock depends primarily on only one factor, and it is possible to describe each factor in terms of the stocks that it affects. Based on which companies are near which axes, you could reasonably conclude that the first factor axis represents the financial sector, the second retail, and the third technology. The original conjecture, that stocks vary primarily within sector, is apparently supported by the data.
Factor Scores. Sometimes, it is useful to be able to classify an observation based on its factor scores. For example, if you accepted the three-factor model and the interpretation of the rotated factors, you might want to categorize each week in terms of how favorable it was for each of the three stock sectors, based on the data from the 10 observed stocks.
Because the data in this example are the raw stock price changes, and not just their correlation matrix, you can have factoran return estimates of the value of each of the three rotated common factors for each week. You can then plot the estimated scores to see how the different stock sectors were affected during each week.
[LoadingsPM,specVarPM,TPM,stats,F] = ...
factoran(stocks, 3,'rotate','promax');
plot3(F(:,1),F(:,2),F(:,3),'b.')
line([-4 4 NaN 0 0 NaN 0 0], [0 0 NaN -4 4 NaN 0 0],...
[0 0 NaN 0 0 NaN -4 4], 'Color','black')
xlabel('Financial Sector')
ylabel('Retail Sector')
zlabel('Technology Sector')
grid on
axis square
view(-22.5, 8)

Oblique rotation often creates factors that are correlated. This plot shows some evidence of correlation between the first and third factors, and you can investigate further by computing the estimated factor correlation matrix.
inv(TPM'*TPM)
ans =
1.0000 0.1559 0.4082
0.1559 1.0000 -0.0559
0.4082 -0.0559 1.0000
Visualizing the Results. You can use the biplot function to help visualize both the factor loadings for each variable and the factor scores for each observation in a single plot. For example, the following command plots the results from the factor analysis on the stock data and labels each of the 10 stocks.
biplot(LoadingsPM,'scores',F,'varlabels',num2str((1:10)'))
xlabel('Financial Sector')
ylabel('Retail Sector')
zlabel('Technology Sector')
axis square
view(155,27)

In this case, the factor analysis includes three factors, and so the biplot is three-dimensional. Each of the 10 stocks is represented in this plot by a vector, and the direction and length of the vector indicates how each stock depends on the underlying factors. For example, you have seen that after promax rotation, the first four stocks have positive loadings on the first factor, and unimportant loadings on the other two factors. That first factor, interpreted as a financial sector effect, is represented in this biplot as one of the horizontal axes. The dependence of those four stocks on that factor corresponds to the four vectors directed approximately along that axis. Similarly, the dependence of stocks 5, 6, and 7 primarily on the second factor, interpreted as a retail sector effect, is represented by vectors directed approximately along that axis.
Each of the 100 observations is represented in this plot by a point, and their locations indicate the score of each observation for the three factors. For example, points near the top of this plot have the highest scores for the technology sector factor. The points are scaled to fit within the unit square, so only their relative locations can be determined from the plot.
You can use the Data Cursor tool from the Tools menu in the figure window to identify the items in this plot. By clicking a stock (vector), you can read off that stock's loadings for each factor. By clicking an observation (point), you can read off that observation's scores for each factor.
![]() | Feature Selection | Cluster Analysis | ![]() |
| © 1984-2008- The MathWorks, Inc. - Site Help - Patents - Trademarks - Privacy Policy - Preventing Piracy - RSS |