%% The Data
% This data set based on one found in "Statistical case studies for
% industrial process improvement" by Czitrom and Spagon, but I generated it
% to illustrate features to be demonstrated here. It consists of oxide
% layer thickness measured in 9 locations on each of 116 semiconductor
% wafers. We also have a record of which of 3 processing types was used for
% each wafer.
%
% The measurements were taken by position on the wafer as follows. Note
% that the first is in the center, the next 4 halfway out, and the last 4
% on the edge.
t = linspace(0,2*pi); fill(2*sin(t),2*cos(t),[0.6 1 0.6]); axis square; axis off
x = [0 0 -1 0 1 -1.3 -1.3 1.3 1.3];
y = [0 1 0 -1 0 1.3 -1.3 -1.3 1.3];
for j=1:length(x)
text(x(j),y(j),num2str(j),'fontsize',12,'fontweight','bold')
end
shg
%% Plot two or three variables
% How can we look at this nine-dimensional data set? We can start by
% looking at just two of the dimensions using a scatter plot. We'll plot
% the thickness at positions 7 and 9, which are at opposite ends of the
% wafer. Note the correlation and the presence of outliers. One wafer is
% thicker in both places, another relatively thin at one edge and thick at
% the opposite edge. As we rotate we can also see the center thickness.
% We can see more structure if we rotate the plot, for instance more
% outliers come into view. Not the two on the left and right that we saw
% before, plus two others now at the bottom.
load webinardata
plot3(X(:,7),X(:,9),X(:,1),'x'); view(2); shg
%% Parallel coordinates plot
% We can look at all coordinates at the same time by creating a parallel
% coordiantes plot. Each wafer is drawn as a line connecting its 9
% thickness values. One wafer appears to have a uniformly thicker coating.
% It's hard to see the other wafers. There are ways to make things easier
% to see.
parallelcoords(X); shg
%% Grouped parallel coordinates
% We code the wafers by processing type. Apparently group E has higher
% thickness at later coordinates (edge of the wafer). Groups I and V
% overlap a lot, but I may be a bit higher.
parallelcoords(X,'group',WTYP); shg
%% Quantiles of parallel coordinate groups
% We show just the group medians and quartiles to reduce clutter. It
% appears group E is generally lower in the interior and higher toward the
% edge.
h = parallelcoords(X,'group',WTYP,'quantile',0.25);
set(h(1:3:end),'linewidth',2); shg
%% Andrews plot
% The Andrews plot uses the coordinates as terms in a Fourier series, so
% each wafer becomes a curve. We'll plot standardized data. Some
% projections have more variability than others (.04 vs .4). Some show
% group difference better than others (0.18). For any angle we can compute
% the Fourier terms, which are coefficients of a linear combination of the
% original standardized variables.
andrewsplot(X,'standardize','on','group',WTYP); shg
% t = [[0.04;0.18;0.4]*2*pi];
% [[1;1;1]/sqrt(2),sin(t),cos(t),sin(2*t),cos(2*t),sin(3*t),cos(3*t),sin(4*t),cos(4*t)]
%% Glyph plot
% We can look for patterns by plotting each wafer as a star or polygon,
% with radii determined by the coordinates. Note 101 is high all around.
% We can also see other patterns (1, 15, 44, etc). We can use the data
% cursor to click on a vertex of a star to get the value of the variable
% plotted in that direction, or on the center to get all variable values.
glyphplot(X); shg
%% Scrolling glyph plot
% In order to see more detail we can make larger stars and scroll them.
% Note data cursors. We could draw faces instead.
glyphplot(X,'grid',[4 5],'page','scroll'); shg
%% Glyphs with mds placement
% To help us find patterns, we can use a technique known as
% multidimensional scaling (more about that later) to position similar
% glyphs together. We might want to change the window size and zoom to
% omit the extreme point.
oldpos = get(gcf,'Position');
close
figure('position',oldpos)
D = pdist(X);
[Y,eigs] = cmdscale(D);
glyphplot(X,'center',Y(:,1:2),'radius',1); shg
%% Hierarchical cluster analysis
% Now let's move from graphical techniques to other analysis techniques.
% First we'll perform cluster analysis, which does have a nice graphical
% representation. We take the 116 wafers and draw a tree diagram linking
% wafers that are close to each other as determined by the 9 thickness
% measurements. It appears we have about 4 clusters plus additional
% outliers. Zoom in on the bottom two clusters. By doing this
% hierarchical clustering we have assigned a cluster number to each wafer.
d = pdist(X);
z = linkage(d,'average');
dendrogram(z,0,'ori','r','col',11);
clustnum = cluster(z,'maxclust',12); shg
%% Cross tabulation
% Let's see how the clusters correspond with processing type by doing a
% cross-tabulation. The largest cluster has a mixture of two types, the
% next largest is a different type, and so on. The E type appears to be
% different from the other two. Six of the clusters are just single-point
% clusters, or outliers.
clc;
commandwindow;
cluster_by_processingtype = crosstab(clustnum,WTYP)
%% MANOVA
% Another multivariate technique available in the Statistics ToolboxTM is
% multivariate analysis of variance, or MANOVA. MANOVA is a more formal
% test for differences between groups in higher-dimensional data. Let's
% see if the distribution of the measurements differs by processing type.
% The p-value for testing whether the group means might be at the same
% point (in a space of 0 dimension) is very small. The p-value for testing
% whether the three group means lie along a line is moderate, about 0.1.
[d,p] = manova1(X,WTYP);
clc
commandwindow;
disp(sprintf('Dimension p-value'))
disp(sprintf('%5d %.4f\n',[0,p(1),1,p(2)]))
%% Principal components
% Remember we began by trying to figure out how to plot 9-dimensional data
% in just 2 or 3 dimensions. An alternative would be to reduce the
% dimensionality from 9 dimensions to something more manageable. Let's use
% principal components and see if the points can be represented in fewer
% dimensions. This graph shows how much of the variance in our data can be
% explained by lower-dimensional principal components. Not the variance
% explained by the first component is about 37, and the next is about 17.
% The remainder are lower values in the range 3-8. This suggests that two
% components might be the right number.
[n,p] = size(X);
[coef,score,root] = princomp(X);
plot(root,'o-');
xlabel('Number of components'); ylabel('Variance explained')
shg
%% Create a biplot
% The biplot is a plot that can help us interpret principal components
% (also factor analysis). It has a dot for each original observation and a
% ray for each original variable, and the axes represent the first two
% principal components. The components are weighted averages of the
% original variables, with weights given by the coordinates of the ends of
% the rays. Notice that all the rays have positive x coordinates. This
% indicates that the first component is an average, or a measure of overall thickness.
% The rays have positive y values for coordinates 1-5 (interior of the
% wafer) and negative y values for coordinates 6-9 (perimeter of the
% wafer). This indicates that the second component is a measure of radial
% uniformity. We can use data cursors to determine which variables go with
% which rays, and to identify the outlier.
xlabel('Dimension'); ylabel('Latent root');
biplot(coef(:,1:2),'score',score(:,1:2));
shg
%% Multidimensonal scaling
% Multidimensional scaling also can reduce dimensionality. It works by
% creating new data matching a given set of distances. Let's reconstruct a
% two-dimensional data set trying to match our original distances. These
% are the Euclidean distances between pairs of our 116 original points in
% 9-dimensional space. The plot shows both the principal component (PC)
% representation and the multidimensional scaling (MDS) representations.
D = pdist(X);
[Y,eigs] = cmdscale(D);
[err,Z,tr] = procrustes(score,Y);
h = plot(score(:,1),score(:,2),'b.', Y(:,1),Y(:,2),'ro'); shg
legend({'PCA Solution' 'CMDS Original'}, 'Location','NorthOutside');
%% Procrustes rotation
% The PC and MDS points look very different, but it appears that if we
% rotated the MDS points 180 degrees they would be close to the PC points.
% There is a technique known as Procrustes rotation that can be used to
% align sets of points this way. We'll try that, and see that they line up
% exactly. This is true because the two techniques give the same results
% when we use classical MDS based on Euclidean distances.
h = plot(score(:,1),score(:,2),'b.', Y(:,1),Y(:,2),'ro');
set(h(2),'color',[.8 .8 .8]);
line(Z(:,1),Z(:,2),'Marker','o','Color','r','LineStyle','none'); shg
legend({'PCA Solution' 'CMDS Original' 'CMDS Rotated'}, 'Location','NorthOutside');
%% Reconstructing distances
% Finally let's compare the distances between the original points in
% 9-dimensional with the PC or MDS points reconstructed in two dimensions.
% They line up well. Notice that the original distances include distance
% contributions in all 9 dimensions, so they are often larger then the
% reconstructed distances.
reconDist = pdist(Z(:,1:2));
plot(D,reconDist,'.'); shg
xlabel('Original Distances'); ylabel('Distances in Reconstruction');
corr(D',reconDist','type','spearman')