When you examine a data plot, you might find that some points appear to differ dramatically from the rest of the data. In some cases, it is reasonable to consider such points outliers, or data values that appear to be inconsistent with the rest of the data.
The following example illustrates how to
remove outliers from three data sets in the 24-by-3 matrix count
.
In this case, an outlier is defined as a value that is more than three
standard deviations away from the mean.
Caution: Be cautious about changing data unless you are confident that you understand the source of the problem you want to correct. Removing an outlier has a greater effect on the standard deviation than on the mean of the data. Deleting one such point leads to a smaller new standard deviation, which might result in making some remaining points appear to be outliers! |
% Import the sample data load count.dat; % Calculate the mean and the standard deviation % of each data column in the matrix mu = mean(count) sigma = std(count)
The Command Window displays
mu = 32.0000 46.5417 65.5833 sigma = 25.3703 41.4057 68.0281
When
an outlier is considered to be more than three
standard deviations away from the mean, use the following syntax to determine
the number of outliers in each column of the count
matrix:
[n,p] = size(count); % Create a matrix of mean values by % replicating the mu vector for n rows MeanMat = repmat(mu,n,1); % Create a matrix of standard deviation values by % replicating the sigma vector for n rows SigmaMat = repmat(sigma,n,1); % Create a matrix of zeros and ones, where ones indicate % the location of outliers outliers = abs(count - MeanMat) > 3*SigmaMat; % Calculate the number of outliers in each column nout = sum(outliers)
The procedure returns the following number of outliers in each column:
nout = 1 0 0
There is one outlier in the first data column of count
and
none in the other two columns.
To remove an entire row of data containing the outlier, type
count(any(outliers,2),:) = [];
Here, any(outliers,2)
returns a 1
when
any of the elements in the outliers
vector are
nonzero. The argument 2
specifies
that any
works down the second
dimension of the count matrix—its columns.