Detect and remove outliers in data
A is a row or column vector,
detects outliers and removes them.
A is a matrix, table, or timetable,
rmoutliers detects outliers in each column or variable of
A separately and removes the entire row.
By default, an outlier is a value that is more than three scaled median absolute deviations (MAD).
Create a vector containing two outliers, and remove them.
TF allows you to identify which elements of the input vector were detected as outliers and removed.
A = [57 59 60 100 59 58 57 58 300 61 62 60 62 58 57]; [B,TF] = rmoutliers(A)
B = 1×13 57 59 60 59 58 57 58 61 62 60 62 58 57
TF = 1x15 logical array 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
ans = 1×2 100 300
Remove outliers of a vector where an outlier is defined as a point more than three standard deviations from the mean of the data.
A = [57 59 60 100 59 58 57 58 300 61 62 60 62 58 57]; [B,TF] = rmoutliers(A,'mean')
B = 1×14 57 59 60 100 59 58 57 58 61 62 60 62 58 57
TF = 1x15 logical array 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
ans = 300
Create a vector of data containing a local outlier.
x = -2*pi:0.1:2*pi; A = sin(x); A(47) = 0;
Create a time vector that corresponds to the data in
t = datetime(2017,1,1,0,0,0) + hours(0:length(x)-1);
Define outliers as points more than three local scaled MAD away from the local median within a sliding window. Find the locations of the outliers in
A relative to the points in
t with a window size of 5 hours, and remove them.
[B,TF] = rmoutliers(A,'movmedian',hours(5),'SamplePoints',t);
Plot the input data and the data with the outlier removed.
plot(t,A,'b.-',t(~TF),B,'r-') legend('Input Data','Output Data')
Create a matrix containing two outliers, and remove the columns containing them.
A = magic(5); A(4,4) = 500; A(5,5) = 500; A
A = 5×5 17 24 1 8 15 23 5 7 14 16 4 6 13 20 22 10 12 19 500 3 11 18 25 2 500
B = rmoutliers(A,2)
B = 5×3 17 24 1 23 5 7 4 6 13 10 12 19 11 18 25
A— Input data
Input data, specified as a vector, matrix, table, or timetable.
method— Method for detecting outliers
Method for detecting outliers, specified as one of the following:
|Outliers are defined as elements more than three scaled MAD from the
median. The scaled MAD is defined as
|Outliers are defined as elements more than three standard deviations from
the mean. This method is faster but less robust than
|Outliers are defined as elements more than 1.5 interquartile ranges above
the upper quartile (75 percent) or below the lower quartile (25 percent). This
method is useful when the data in |
|Outliers are detected using Grubbs’s test for outliers, which removes one
outlier per iteration based on hypothesis testing. This method assumes that
the data in |
|Outliers are detected using the generalized extreme Studentized deviate
test for outliers. This iterative method is similar to
threshold— Percentile thresholds
Percentile thresholds, specified as a two-element row vector whose elements are in
the interval [0,100]. The first element indicates the lower percentile threshold and the
second element indicates the upper percentile threshold. For example, a threshold of
[10 90] defines outliers as points below the 10th percentile and
above the 90th percentile. The first element of
threshold must be
less than the second element.
movmethod— Moving method
Moving method for determining outliers, specified as one of the following:
|Outliers are defined as elements more than three local scaled MAD from
the local median over a window length specified by
|Outliers are defined as elements more than three local standard
deviations from the local mean over a window length specified by
window— Window length
Window length, specified as a scalar or two-element vector.
window is a positive integer scalar, the window is centered
about the current element and contains
window-1 neighboring elements.
window is even, then the window is centered about the current and
window is a two-element vector of positive integers
[b f], the window contains the current element,
b elements backward, and
A is a timetable or
specified as a
window must be of type
duration, and the windows
are computed relative to the sample points.
dim— Operating dimension
Operating dimension, specified as 1 or 2. By default,
operates along the first dimension whose size does not equal 1.
comma-separated pairs of
the argument name and
Value is the corresponding value.
Name must appear inside quotes. You can specify several name and value
pair arguments in any order as
'ThresholdFactor'— Detection threshold factor
Detection threshold factor, specified as the comma-separated pair consisting of
'ThresholdFactor' and a nonnegative scalar.
detection threshold factor replaces the number of scaled MAD, which is 3 by
detection threshold factor replaces the number of standard deviations from the mean,
which is 3 by default.
detection threshold factor is a scalar ranging from 0 to 1. Values close to 0 result
in a smaller number of outliers and values close to 1 result in a larger number of
outliers. The default detection threshold factor is 0.5.
'quartile' method, the detection threshold factor
replaces the number of interquartile ranges, which is 1.5 by default.
This name-value pair is not supported when the specified method is
'SamplePoints'— Sample points
Sample points, specified as the comma-separated pair consisting of
'SamplePoints' and a vector. The sample points represent the
location of the data in
A, and must be sorted and contain unique
elements. Sample points do not need to be uniformly sampled. If
is a timetable, then the default sample points vector is the vector of row times.
Otherwise, the default vector is
[1 2 3 ...].
Moving windows are defined relative to the sample points. For example, if
t is a vector of times corresponding to the input data, then
rmoutliers(rand(1,10),'movmean',3,'SamplePoints',t) has a window
that represents the time interval between
When the sample points vector has data type
duration, then the moving window length must have type
'DataVariables'— Table variables
Table variables, specified as the comma-separated pair consisting of
'DataVariables' and a variable name, a cell array of variable
names, a numeric vector, a logical vector, or a function handle. The
'DataVariables' value indicates which columns of the input table
to detect outliers in, and can be one of the following:
A character vector specifying a single table variable name
A cell array of character vectors where each element is a table variable name
A vector of table variable indices
A logical vector whose elements each correspond to a table variable, where
true includes the corresponding variable and
false excludes it
A function handle that takes the table as input and returns a logical scalar
'MinNumOutliers'— Minimum outlier count
Minimum outlier count, specified as the comma-separated pair consisting of
'MinNumOutliers' and a positive scalar. The
'MinNumOutliers' value specifies the minimum number of outliers
required to remove a row or column. For example,
rmoutliers(A,'MinNumOutliers',3) removes a row of a matrix
A when there are 3 or more outliers detected in that
'MaxNumOutliers'— Maximum outlier count
Maximum outlier count, for the
'gesd' method only, specified as
the comma-separated pair consisting of
'MaxNumOutliers' and a
positive scalar. The
'MaxNumOutliers' value specifies the maximum
number of outliers returned by the
'gesd' method. For example,
rmoutliers(A,'MaxNumOutliers',5) returns no more than five
The default value for
'MaxNumOutliers' is the integer nearest
to 10 percent of the number of elements in
A. Setting a larger
value for the maximum number of outliers can ensure that all outliers are detected,
but at the cost of reduced computational efficiency.
B— Data with outliers removed
Data with outliers removed, returned as a vector, matrix, table, or timetable. The
B depends on the number of removed rows or columns.
TF— Removed data indicator
Removed data indicator, returned as a logical vector. The value 1
true) corresponds to rows or columns in
were removed. The value 0 (
false) corresponds to unchanged rows or
columns. The orientation and size of
TF depends on
A and the dimension of operation.
Usage notes and limitations:
'gesd' methods are not supported.
'movmean' methods do not support tall timetables.
'MaxNumOutliers' name-value pairs are not supported.
The value of
'DataVariables' cannot be a function handle.
rmoutliers(A,'quartiles',...) along the first dimension is only supported for tall column vectors
rmoutliers(A,2) is not supported for tall tables.
For more information, see Tall Arrays.
Usage notes and limitations:
'movmedian' methods do not
'SamplePoints' name-value pair argument.
Tables are not supported.