image thumbnail

Multivariate Data Analysis and Monitoring for the Process Industries

by

 

26 Feb 2008 (Updated )

Files from the webinar Multivariate Data Analysis and Monitoring for the Process Industries

Multivariate Statistical Process Monitoring of a Copper Production Process

Multivariate Statistical Process Monitoring of a Copper Production Process

We will be analysing data from a continuous process of electrolytic copper production at Boliden AB (Skelleftehamn, Sweden).

The aim of the analysis is firstly, to gain an understanding of the systematic variation present in the process (its range of normal behaviour), and secondly to design a method of monitoring the process to capture departures from normal behaviour.

We start by attempting to use traditional univariate methods to understand the process, and rapidly find that this is hard. We therefore turn to multivariate methods, using Principal Component Analysis.

Finally, we discuss how the PCA model that we have developed might be deployed to monitor live data. We do this in two different ways. Firstly we incorporate the PCA model plots into a GUI, and connect the plots to a live feed of the data using the OPC toolbox: creating a 'dashboard' view of the live process to enable easy online monitoring of the process behaviour. This GUI can be deployed as a standalone executable using the MATLAB Compiler. Secondly, we take the PCA plots and incorporate them into a webpage. The webpage is designed in ASP.NET, and communicates with a database to access the data: enabling remote monitoring of the system via the web or intranet.

Sam Roberts

Contents

PRODUCTS REQUIRED

Most of this demonstration requires only MATLAB and Statistics Toolbox. Accessing data via OPC requires OPC Toolbox. Although the GUI application ProcessMonitor can be created and executed with the above products, compiling to a standalone executable also requires MATLAB Compiler. The deployed web application requires Database Toolbox to access data from databases, and to compile it to a .NET assembly for web deployment requires MATLAB Builder for .NET.

SECTION I: BUILDING A MODEL OF THE PROCESS

Load in data

Two copper samples were measured each day for one year, and the levels of eight metal impurities (Ag, Ni, Pb, Bi, Sb, As, Te, Se) were recorded. The dataset contains 730 measurements (2/day for 365 days) of the 8 impurities: these are contained in the array 'metals'. There are also two arrays 'varnames' and 'obsnames' containing variable and observation names. The variable 'TAI' refers to 'Total Analysis Index', a measure devised by the dataset originators as a summary metric of sample quality. It is instructive to compare the analysis here to the TAI summary metric.

load metal
whos
[numobs, numvars] = size(metals);
  Name            Size            Bytes  Class     Attributes

  TAI           730x1              5840  double              
  metals        730x8             46720  double              
  obsname       730x1             62780  cell                
  varnames        1x8               512  cell                

Overview of the variables

Let's start off by examining the distributions of each of the variables (metal impurities).

boxplot(metals,'Labels',varnames); %#ok

xlabel('Metal');
ylabel('Level of Impurity');

Create standardised data

We can see from the boxplot that the variances of the metal impurities are not equal. We would like to be able to consider each variable on the same scale, so let's standardise them to have a mean of 0 and a variance of 1.

means = mean(metals);
sds = std(metals);
metalstd = (metals - repmat(means,numobs,1)) ./ repmat(sds,numobs,1);

boxplot(metalstd,'Labels', varnames);

xlabel('Metal');
ylabel('Level of Impurity (Standardised)');

shg; % Bring the current figure to the front

Attempt to visualise the variables

We need to understand how the variables change over time, and how they are related to each other. Let's attempt to get a picture of those things by plotting all the variables together on the same plot.

figure;
plot(metalstd);

xlim([0, numobs]);
xlabel('Observation Number');
ylabel('Level of Impurity (Standardised)');
legend(varnames, 'Location', 'NorthEastOutside');

Make a better attempt

That clearly wasn't sensible - we couldn't see anything properly, as there was just too much information. We can at least separate out the eight variables into their own plots.

figure('Units', 'Normalized', 'Position', [0.6,0.1,0.4,0.8]);
suptitle('Metal Impurities over Time');

haxis(numvars) = 0;
for i = 1:numvars
    haxis(i) = subplot(numvars, 1, i);
    plot(metalstd(:, i));
    ylabel(varnames{i});
    xlim([0, numobs]);
    if i < numvars
        set(haxis(i), 'XTickLabel', []);
    else
        xlabel('Observation Number');
    end
end

That was a little better - for each variable, it's possible to see how it varies over time, and it's maybe possible to see that, for example, Se seems to be correlated with Te. We can improve the view of each variable by placing each variable on the same scale, and adding control limits at 2 and 3 standard deviations to give us an indication of exceptional behaviour.

for ax = 1:numvars
    axes(haxis(ax));
    ylim([-5, 5]);
    upperlimits = [4, 3, -2, -3];
    lowerlimits = [3, 2, -3, -4];
    colors = {'r', 'y', 'y', 'r'};
    for i = 1:4
        patch([0, 0, numobs, numobs],...
            [upperlimits(i), lowerlimits(i), lowerlimits(i), upperlimits(i)],...
            colors{i}, 'EdgeColor', 'none', 'FaceAlpha', 0.4);
    end
end
shg;

Visualising observations

This gives us an overview of the dataset, although it's still quite difficult to get a deep understanding from this graphic of how the process variables are related to each other - we can't see the wood for the trees. We could try instead to focus just on a single timepoint, eg observation 1. Let's plot in polar coordinates for a nice visualisation. This gives us a picture of all the variables, but just at a single timepoint.

figure;

offset = 4;

[X,Y] = pol2cart(...
    (0:1/numvars:1)*2*pi,...
    [metalstd(1,:), metalstd(1,1)] + offset);
plot(X,Y);

axis off square;
hold on;

% Plot upper and lower control limits at 2 and 3 sd

for i = 1:4
    [X,Y] = pol2cart(...
        [0:1/99:1,1:-1/99:0]*2*pi,...
        [ones(1,100)* (upperlimits(i)+offset),...
        ones(1,100)* (lowerlimits(i)+offset)]);
    patch(X, Y, colors{i}, 'EdgeColor', 'none', 'FaceAlpha', 0.4);
end

% Label variables nicely

[X,Y] = pol2cart(...
    (0:1/numvars:1)*2*pi,...
    ones(1,numvars+1)*(4 + offset));
for i = 1:numvars
    h = text(X(i), Y(i), varnames{i});
    circlepos = (i-1)/numvars;
    if circlepos < 0.25
        set(h, 'VerticalAlignment', 'bottom', 'HorizontalAlignment', 'left');
    elseif circlepos == 0.25
        set(h, 'VerticalAlignment', 'bottom', 'HorizontalAlignment', 'center');
    elseif circlepos < 0.5
        set(h, 'VerticalAlignment', 'bottom', 'HorizontalAlignment', 'right');
    elseif circlepos == 0.5
        set(h, 'VerticalAlignment', 'middle', 'HorizontalAlignment', 'right');
    elseif circlepos < 0.75
        set(h, 'VerticalAlignment', 'top', 'HorizontalAlignment', 'right');
    elseif circlepos == 0.75
        set(h, 'VerticalAlignment', 'top', 'HorizontalAlignment', 'center');
    elseif circlepos < 1
        set(h, 'VerticalAlignment', 'top', 'HorizontalAlignment', 'left');
    end
end

shg;

Multivariate Methods

These are nice plots, but we can still only really see one variable or observation at a time. We want to get an overview of the process as a whole. In order to achieve this we must turn to multivariate analysis methods.

Carry out PCA

We can extract a lot more information with a principal component analysis. PCA constructs new variables (principal components, or PCs) that are weighted sums of the original variables. Unlike TAI, the weights do not reflect arbitrary ideas of how important the variables might be, but are constructed so that the new variables capture a maximal proportion of the variance in the original data. By examining a small number of principal components, you can get an understanding of the process which is nearly as good as examining all of the original variables.

[loadings, scores, variances, tscores] = princomp(metalstd);

To see how well the PCA model describes the data, we display the cumulative variances accounted for by successive PCs in a pareto plot. When interpreting this plot, we're looking for a sharp bend or 'elbow' in the curve: this would indicate that the components before the elbow represented systematic, interesting variation in the process, and the later components represented noise variation; we could remove those and concentrate on the systematic components. Unfortunately, there's no clear elbow in this plot, so we'll rely on our judgement and keep just the first two components. The 2 component model accounts for 67% of the variation in the data, therefore by examining PCs 1 and 2, we are getting a view of the process which is 67% as good as examining all the original variables. 67% is not huge, however, and one might want to come back later and revisit this decision, perhaps including the first three PCs to capture 76% of the variance.

figure;
pareto(variances);
xlabel('Principal Component')
ylabel('Variance')

Create a biplot of the PCA scores and loadings.

We can use the PCA model to get a lot more insight into the behaviour of the process, by plotting the PCA scores and loadings on the same plot. Note firstly that all variables have a high loading on PC1, indicating that PC1 reflects the average level of impurity. Note secondly that in PC2, the variables fall into two distinct clusters, implying the existence of two distinct impurity profiles: one containing high levels of Sb, As and Bi (all group 5a elements in the periodic table), and low levels of Ni, Pb, Se, Ag, and Te; the other profile having the reverse pattern. The numbers on the axes do not mean anything in absolute terms.

figure
hold on;
scatter(scores(:,1), scores(:,2), 5, 'bo');
scale = max((abs(scores(:,1:2)))) ./ max((abs(loadings(:,1:2))));
scatter(loadings(:,1)*scale(1), loadings(:,2)*scale(2), 'ro', 'filled');
for i = 1:numvars
    line([0,loadings(i,1)*scale(1)],...
        [0,loadings(i,2)*scale(2)], 'Color', 'r');
    text(loadings(i,1)*scale(1)-1, loadings(i,2)*scale(2), varnames{i});
end
title('Principal Component Biplot');
xlabel('PC 1');
ylabel('PC 2');
xlim([-17,4]);
shg;

Multivariate control charts

The PCA biplot and contribution plots have given us a lot more insight into the behaviour of the process. They have one disadvantage over the univariate methods considered earlier, however, in that they don't have control limits displayed. We can extract more information from the PCA model, however, which is suitable for displaying in a control chart.

The T2 statistic gives a measure of the distance of a sample from the process mean within the plane defined by PC1 and PC2, and the squared PCA residuals give a measure of the distance of a sample perpendicular to that plane. A high T2 statistic thus indicates that a sample is exhibiting an extreme variation, but well-accounted for by the PCA model. A high residual indicates that the sample is exhibiting a form of variation which is not well-accounted for by the PCA model.

We can plot both T2 and the PCA residual on control charts.

% Calculate T2 scores and critical 95, 99 and 99.9% limits
t2 = sum((scores(:,1:2).^2)./repmat(variances(1:2)',numobs,1),2);
tcrit = (2*(numobs^2-1)/(numobs*(numobs-2)))*...
    [finv(0.999,2,numobs-2),...
    finv(0.99,2,numobs-2),...
    finv(0.95,2,numobs-2)];

% Calculate squared residuals and critical 95, 99 and 99.9% limits
residuals = (metalstd - scores(:,1:2)*loadings(:,1:2)');
dist = sqrt(sum(residuals.^2,2)./(numvars-2));
distcrit = sqrt(sum(sum(residuals.^2))./((numobs-3)*(numvars-2)))*...
    sqrt([finv(0.999,numvars-2,((numobs-3)*(numvars-2))),...
    finv(0.99,numvars-2,((numobs-3)*(numvars-2))),...
    finv(0.95,numvars-2,((numobs-3)*(numvars-2)))]);

t2distfig = figure;
t2ax = subplot(2,1,1);
plot(t2);
xlim([0,numobs]);
xlabel('Observation Number');
ylabel('T^{2}');

upperlimits = [tcrit(1),tcrit(2)];
lowerlimits = [tcrit(2),tcrit(3)];
colors = {'r', 'y'};

for i = 1:2
    patch([0, 0, numobs, numobs],...
        [upperlimits(i), lowerlimits(i), lowerlimits(i), upperlimits(i)],...
        colors{i}, 'EdgeColor', 'none', 'FaceAlpha', 0.4);
end

distax = subplot(2,1,2);
plot(dist);
xlim([0,numobs]);
xlabel('Observation Number');
ylabel('Distance to Model');

upperlimits = [distcrit(1),distcrit(2)];
lowerlimits = [distcrit(2),distcrit(3)];
colors = {'r', 'y'};

for i = 1:2
    patch([0, 0, numobs, numobs],...
        [upperlimits(i), lowerlimits(i), lowerlimits(i), upperlimits(i)],...
        colors{i}, 'EdgeColor', 'none', 'FaceAlpha', 0.4);
end

Summary of Analysis

We have used MATLAB to process and visualise a continuous process of copper production, creating some custom displays of the data to emphasise relevant properties of the process variation. Using multivariate statistics (principal component analysis), we have built a model of the process which captures the important variation in a small number of variables, enabling monitoring of the process as a whole using a few informative plots, rather than large numbers of graphs of each individual process variable. Using this approach has enabled us to discover patterns of correlation among the variables indicative of different impurity profiles; to recognise outliers, and to classify them by assigning a reason for their abnormal status.

Saving the PCA model

We can save relevant aspects of the PCA model to a .mat file for later use, and clear up MATLAB.

loadings = loadings(:,1:2);
save pcamodel means sds loadings variances scale tcrit distcrit varnames
clear all; close all force; clc;

INTERLUDE: OPC

The OPC Toolbox

Open Process Control (OPC), also known as OLE for Process Control, is a series of specifications defined by the OPC Foundation (http://www.opcfoundation.org) for supporting open connectivity in industrial automation. OPC uses Microsoft DCOM technology to provide a communication link between OPC servers and OPC clients. OPC has been designed to provide reliable communication of information in a process plant, such as a petrochemical refinery, an automobile assembly line, or a paper mill.

Using the OPC Toolbox, you can acquire live OPC data directly into MATLAB and Simulink, and write data directly to the OPC server from MATLAB and Simulink.

For this demonstration, we need to access an OPC server. Please follow the instructions contained in SettingUpTheIconicsServer.txt.

Connecting to an OPC server

Using commands from the OPC toolbox, we can connect to an OPC server.

da = opcda('localhost', 'Iconics.SimulatorOPCDA.2');
connect(da);

disp(da);
Summary of OPC Data Access Client Object: localhost/Iconics.SimulatorOPCDA.2

   Server Parameters
      Host      : localhost
      ServerID  : Iconics.SimulatorOPCDA.2
      Status    : connected
      Timeout   : 10 seconds

   Object Parameters
      Group     : 0-by-1 dagroup object
      Event Log : 0 of 1000 events

We can create a group of variables ('items') for monitoring. Here we use some of the built in variables that come with the Iconics simulation server.

monitorgroup = addgroup(da, 'Monitor Group');
additem(monitorgroup, 'Numeric.Sine');
additem(monitorgroup, 'Logical.Random');

disp(monitorgroup);
Summary of OPC Data Access Group Object: Monitor Group

   Object Parameters
      Group Type   : private
      Item         : 2-by-1 daitem object
      Parent       : localhost/Iconics.SimulatorOPCDA.2
      Update Rate  : 0.5
      Deadband     : 0%

   Object Status
      Active       : on
      Subscription : on
      Logging      : off

   Logging Parameters
      Records      : 120
      Duration     : at least 60 seconds
      Logging to   : memory
      Status       : Waiting for START.
                     0 records available for GETDATA/PEEKDATA

We can also read those variables. The data is returned in a structure which contains the values, as well as an ID, a data quality measure, a timestamp, and an error field. We can extract the values from the structure.

data = read(monitorgroup);
disp(data(1).Value);
disp(data(2).Value);
   -0.9479

     0

All of these operations can be carried out from the command line, or using the graphical user interface which comes with the OPC toolbox.

hopc = opctool; % Demo OPC tool here.

Right-click on 'OPC Network', and select 'Add Host'. Enter the hostname as 'localhost'. If you have followed the instructions earlier in SettingUpTheIconicsServer.txt, you should now see a new host 'localhost', with an OPC server 'ICONICS.SimulatorOPCDA2'. Right-click on the server, and select 'Create Client'. In the area 'MATLAB OPC Clients', you should now see a client for this server. It is disconnected (indicated in red on the right hand panel, but clicking 'Connect' will connect you. Back on the left-hand panel, select the server and then click the button from the toolbar 'View Namespace'. This displays a tree view of the signals available on the server. Select 'Numeric'->'Sine', right-click, and Add To New Group. This will add the signal (a dummy sine wave signal that ships with the simulation server) to a signal monitoring group. Selecting the signal from within the OPC client panel will now display the current value of the signal in the data panel. This can be logged to a file, or to a MATLAB variable for further analysis. opctool can also be used to troubleshoot OPC coneections.

Close the opctool

hopc.close;
clear hopc;

In SettingUpTheIconicsServer.txt we used the OPC Configuration Tool that comes with the server to set up eight simulation signals, and eight items in the address space of the server. The items were named Metals.Ag, Metals.Ni etc, to correspond to each of the variables in the dataset. Using the OPC toolbox, we can write our data to the variables, and read it back. We'll use this ability in the next section.

disconnect(da);
clear all; close all force; clc;

SECTION II: APPLYING THE MODEL IN AN ONLINE MONITORING TOOL

A GUI view of the PCA model plots

The file ProcessMonitor.m uses the OPC toolbox, and MATLAB's ability to create graphical user interfaces, to build a 'dashboard' view of an online process. A timer object reads the current value of our eight variables from the OPC server, applies our PCA model, and plots the results into the GUI. Uncomment the following line and execute it to take a look at the program.

% edit ProcessMonitor.m

Utility function

Because we don't have a real live system, just a simulation server, we have a little utility program WriteToOPC.m, which cycles through our dataset and periodically writes the next value of our variables to the OPC server. This utility is entirely disconnected from our GUI; the GUI is reading data directly from the OPC server as if it were real live data. Uncomment the following line and execute it to take a look at the program.

% edit writeToOPC.m

See the GUI in action

Uncomment the following lines to run the GUI. Leave them commented to just show a screenshot of the GUI.

% writeToOPC

% ProcessMonitor

figure;
iptsetpref('ImshowBorder','tight');
imshow('PMscreenshot.png');

Compiling the GUI

Using the MATLAB Compiler we can produce standalone executable versions of the utility program and the Process Monitor GUI. Quit MATLAB, and run writeToOPC.exe, and then ProcessMonitor.exe.

% mcc -m writeToOPC

% mcc -m ProcessMonitor

SECTION III: DEPLOYING THE MODEL TO A WEBPAGE FOR REMOTE MONITORING

Another application of common interest to those in the process industries is to display historical data, often stored in a database, via webpage. The file getCurrentProcessStateAsImage.m is a function that connects to a MySQL database which is being continuously updated with live process data. It retrieves the latest set of data, applies our PCA model, and returns the analysis graphs as an image. We compile this function to a .NET object using MATLAB Builder for .NET, and it can then be called from an ASP.NET webpage written in C#.

% Please note that here we only demonstrate how to code the web application
% via MATLAB and C#. There is a little more work to be done in Visual
% Studio in order to set the project up as a website and get it running
% that we do not document here.

% edit getCurrentProcessStateAsImage

% mcc -F MPSMdotnetobject.prj

% edit GeneratePlot.aspx.cs

References

Conny Wikström, Christer Albano, Lennart Eriksson, Håkan Fridén, Erik Johansson, Åke Nordahl, Stefan Rännar, Maria Sandberg, Nouna Kettaneh-Wold and Svante Wold (1998). Multivariate process and quality monitoring applied to an electrolysis process: Part I. Process supervision with multivariate control charts. Chemometrics and Intelligent Laboratory Systems 42(1-2): 221-231.

Contact us