image thumbnail
from New Statistics Tools in MATLAB(R) and the Statistics ToolboxTM by Dan Doherty
M-files used in the webinar held on November 18, 2004.

multivariatescript.m
%% The Data
% This data set based on one found in "Statistical case studies for
% industrial process improvement" by Czitrom and Spagon, but I generated it
% to illustrate features to be demonstrated here.  It consists of oxide
% layer thickness measured in 9 locations on each of 116 semiconductor
% wafers. We also have a record of which of 3 processing types was used for
% each wafer.
%
% The measurements were taken by position on the wafer as follows.  Note
% that the first is in the center, the next 4 halfway out, and the last 4
% on the edge.
t = linspace(0,2*pi); fill(2*sin(t),2*cos(t),[0.6 1 0.6]); axis square; axis off
x = [0 0 -1  0 1 -1.3 -1.3 1.3 1.3];
y = [0 1  0 -1 0 1.3 -1.3 -1.3 1.3];
for j=1:length(x)
    text(x(j),y(j),num2str(j),'fontsize',12,'fontweight','bold')
end
shg

%% Plot two or three variables
% How can we look at this nine-dimensional data set?  We can start by
% looking at just two of the dimensions using a scatter plot.  We'll plot
% the thickness at positions 7 and 9, which are at opposite ends of the
% wafer.  Note the correlation and the presence of outliers.  One wafer is
% thicker in both places, another relatively thin at one edge and thick at
% the opposite edge.  As we rotate we can also see the center thickness.
% We can see more structure if we rotate the plot, for instance more
% outliers come into view.  Not the two on the left and right that we saw
% before, plus two others now at the bottom.
load webinardata
plot3(X(:,7),X(:,9),X(:,1),'x'); view(2); shg

%% Parallel coordinates plot
% We can look at all coordinates at the same time by creating a parallel
% coordiantes plot.  Each wafer is drawn as a line connecting its 9
% thickness values.  One wafer appears to have a uniformly thicker coating.
% It's hard to see the other wafers.  There are ways to make things easier
% to see.
parallelcoords(X); shg

%% Grouped parallel coordinates
% We code the wafers by processing type.  Apparently group E has higher
% thickness at later coordinates (edge of the wafer).  Groups I and V
% overlap a lot, but I may be a bit higher.
parallelcoords(X,'group',WTYP); shg

%% Quantiles of parallel coordinate groups
% We show just the group medians and quartiles to reduce clutter.  It
% appears group E is generally lower in the interior and higher toward the
% edge.
h = parallelcoords(X,'group',WTYP,'quantile',0.25);
set(h(1:3:end),'linewidth',2); shg

%% Andrews plot
% The Andrews plot uses the coordinates as terms in a Fourier series, so
% each wafer becomes a curve.  We'll plot standardized data.  Some
% projections have more variability than others (.04 vs .4).  Some show
% group difference better than others (0.18).  For any angle we can compute
% the Fourier terms, which are coefficients of a linear combination of the
% original standardized variables.
andrewsplot(X,'standardize','on','group',WTYP); shg
% t = [[0.04;0.18;0.4]*2*pi];
% [[1;1;1]/sqrt(2),sin(t),cos(t),sin(2*t),cos(2*t),sin(3*t),cos(3*t),sin(4*t),cos(4*t)]

%% Glyph plot
% We can look for patterns by plotting each wafer as a star or polygon,
% with radii determined by the coordinates.  Note 101 is high all around.
% We can also see other patterns (1, 15, 44, etc).  We can use the data
% cursor to click on a vertex of a star to get the value of the variable
% plotted in that direction, or on the center to get all variable values.
glyphplot(X); shg

%% Scrolling glyph plot
% In order to see more detail we can make larger stars and scroll them.
% Note data cursors.  We could draw faces instead.
glyphplot(X,'grid',[4 5],'page','scroll'); shg

%% Glyphs with mds placement
% To help us find patterns, we can use a technique known as
% multidimensional scaling (more about that later) to position similar
% glyphs together.  We might want to change the window size and zoom to
% omit the extreme point.
oldpos = get(gcf,'Position');
close
figure('position',oldpos)
D = pdist(X);
[Y,eigs] = cmdscale(D);
glyphplot(X,'center',Y(:,1:2),'radius',1); shg

%% Hierarchical cluster analysis
% Now let's move from graphical techniques to other analysis techniques.
% First we'll perform cluster analysis, which does have a nice graphical
% representation.  We take the 116 wafers and draw a tree diagram linking
% wafers that are close to each other as determined by the 9 thickness
% measurements.  It appears we have about 4 clusters plus additional
% outliers.  Zoom in on the bottom two clusters.  By doing this
% hierarchical clustering we have assigned a cluster number to each wafer.
d = pdist(X);
z = linkage(d,'average');
dendrogram(z,0,'ori','r','col',11);
clustnum = cluster(z,'maxclust',12); shg

%% Cross tabulation
% Let's see how the clusters correspond with processing type by doing a
% cross-tabulation.  The largest cluster has a mixture of two types, the
% next largest is a different type, and so on.  The E type appears to be
% different from the other two.  Six of the clusters are just single-point
% clusters, or outliers.
clc;
commandwindow;
cluster_by_processingtype = crosstab(clustnum,WTYP)

%% MANOVA
% Another multivariate technique available in the Statistics ToolboxTM is
% multivariate analysis of variance, or MANOVA.  MANOVA is a more formal
% test for differences between groups in higher-dimensional data.  Let's
% see if the distribution of the measurements differs by processing type.
% The p-value for testing whether the group means might be at the same
% point (in a space of 0 dimension) is very small.  The p-value for testing
% whether the three group means lie along a line is moderate, about 0.1.
[d,p] = manova1(X,WTYP);
clc
commandwindow;
disp(sprintf('Dimension   p-value'))
disp(sprintf('%5d       %.4f\n',[0,p(1),1,p(2)]))

%% Principal components
% Remember we began by trying to figure out how to plot 9-dimensional data
% in just 2 or 3 dimensions.  An alternative would be to reduce the
% dimensionality from 9 dimensions to something more manageable.  Let's use
% principal components and see if the points can be represented in fewer
% dimensions.  This graph shows how much of the variance in our data can be
% explained by lower-dimensional principal components.  Not the variance
% explained by the first component is about 37, and the next is about 17.
% The remainder are lower values in the range 3-8.  This suggests that two
% components might be the right number.
[n,p] = size(X);
[coef,score,root] = princomp(X);
plot(root,'o-');
xlabel('Number of components'); ylabel('Variance explained')
shg

%% Create a biplot
% The biplot is a plot that can help us interpret principal components
% (also factor analysis).  It has a dot for each original observation and a
% ray for each original variable, and the axes represent the first two
% principal components.  The components are weighted averages of the
% original variables, with weights given by the coordinates of the ends of
% the rays.  Notice that all the rays have positive x coordinates.  This
% indicates that the first component is an average, or a measure of overall thickness.
% The rays have positive y values for coordinates 1-5 (interior of the
% wafer) and negative y values for coordinates 6-9 (perimeter of the
% wafer).  This indicates that the second component is a measure of radial
% uniformity.  We can use data cursors to determine which variables go with
% which rays, and to identify the outlier.
xlabel('Dimension'); ylabel('Latent root');
biplot(coef(:,1:2),'score',score(:,1:2));
shg

%% Multidimensonal scaling
% Multidimensional scaling also can reduce dimensionality.  It works by
% creating new data matching a given set of distances.  Let's reconstruct a
% two-dimensional data set trying to match our original distances.  These
% are the Euclidean distances between pairs of our 116 original points in
% 9-dimensional space.  The plot shows both the principal component (PC)
% representation and the multidimensional scaling (MDS) representations.
D = pdist(X);
[Y,eigs] = cmdscale(D);
[err,Z,tr] = procrustes(score,Y);
h = plot(score(:,1),score(:,2),'b.', Y(:,1),Y(:,2),'ro'); shg
legend({'PCA Solution' 'CMDS Original'}, 'Location','NorthOutside');

%% Procrustes rotation
% The PC and MDS points look very different, but it appears that if we
% rotated the MDS points 180 degrees they would be close to the PC points.
% There is a technique known as Procrustes rotation that can be used to
% align sets of points this way.  We'll try that, and see that they line up
% exactly.  This is true because the two techniques give the same results
% when we use classical MDS based on Euclidean distances.
h = plot(score(:,1),score(:,2),'b.', Y(:,1),Y(:,2),'ro');
set(h(2),'color',[.8 .8 .8]);
line(Z(:,1),Z(:,2),'Marker','o','Color','r','LineStyle','none'); shg
legend({'PCA Solution' 'CMDS Original' 'CMDS Rotated'}, 'Location','NorthOutside');

%% Reconstructing distances
% Finally let's compare the distances between the original points in
% 9-dimensional with the PC or MDS points reconstructed in two dimensions.
% They line up well.  Notice that the original distances include distance
% contributions in all 9 dimensions, so they are often larger then the
% reconstructed distances.
reconDist = pdist(Z(:,1:2));
plot(D,reconDist,'.'); shg
xlabel('Original Distances'); ylabel('Distances in Reconstruction');

corr(D',reconDist','type','spearman')

Contact us at files@mathworks.com