Documentation |
When you examine a data plot, you might find that some points appear to differ dramatically from the rest of the data. In some cases, it is reasonable to consider such points outliers, or data values that appear to be inconsistent with the rest of the data.
The following example illustrates how to remove outliers from three data sets in the 24-by-3 matrix count. In this case, an outlier is defined as a value that is more than three standard deviations away from the mean.
Caution Be cautious about changing data unless you are confident that you understand the source of the problem you want to correct. Removing an outlier has a greater effect on the standard deviation than on the mean of the data. Deleting one such point leads to a smaller new standard deviation, which might result in making some remaining points appear to be outliers! |
% Import the sample data load count.dat; % Calculate the mean and the standard deviation % of each data column in the matrix mu = mean(count) sigma = std(count)
The Command Window displays
mu = 32.0000 46.5417 65.5833 sigma = 25.3703 41.4057 68.0281
When an outlier is considered to be more than three standard deviations away from the mean, use the following syntax to determine the number of outliers in each column of the count matrix:
[n,p] = size(count); % Create a matrix of mean values by % replicating the mu vector for n rows MeanMat = repmat(mu,n,1); % Create a matrix of standard deviation values by % replicating the sigma vector for n rows SigmaMat = repmat(sigma,n,1); % Create a matrix of zeros and ones, where ones indicate % the location of outliers outliers = abs(count - MeanMat) > 3*SigmaMat; % Calculate the number of outliers in each column nout = sum(outliers)
The procedure returns the following number of outliers in each column:
nout = 1 0 0
There is one outlier in the first data column of count and none in the other two columns.
To remove an entire row of data containing the outlier, type
count(any(outliers,2),:) = [];
Here, any(outliers,2) returns a 1 when any of the elements in the outliers vector is a nonzero number. The argument 2 specifies that any works down the second dimension of the count matrix—its columns.