File Exchange

Remove outliers

version 1.1.0.0 (10.6 KB) by M Sohrabinia

M Sohrabinia (view profile)

Turns outliers from a vector or matrix to NaN based on modified Thompson Tau method

Updated 10 Sep 2014

This function accepts a vector or matrix and detects the outlier values in the vector/matrix using Thopson's Tau method, which is based on the absolute deviation of each record from the mean of the entire vector/matrix, and fills the outliers with NaNs in the returned output.
The magnitude of Thompson's Tau value corresponding to the number of records in the input vector (m) or matrix (m*n) to the Standard Deviation (std) of the input vector/matrix is the rule to decide if any record is in the outliers. The mean, std and the magnitude of Thompson's Tau (tau*std) are calculated again after removal of each outlier. If the input is matrix, it will be converted to a vector before detecting the outliers, however, the output will be a matrix with the same m*n dimensions as input. Indexes of the outleirs also will be returned, where if the input was a vector, the index vector also will be a vector, however, if the input was a matrix, outlier indexes will be returned in a two-column matrix showing i,j indexes of the outliers (see examples below).
--Inputs:
X0: input vector or matrix which contains outleirs
num_outliers: number of outliers that should be removed from the input vector/matrix

--Outputs:
X: output vector/matrix with outliers (if any detected) turned to NaN
outliers_idx: the index(es) of any detected outliers, the more extreme
outliers will be detected first, so the first index refers to the most extreme outlier and so forth

--Theory of Thompson Tau method:
http://www.mne.psu.edu/me345/Lectures/Outliers.pdf
http://www.jstor.org/stable/2345543 (Thompson, 1985)

--Note: this function is an improvement based on Vince Petaccio, 2009: http://www.mathworks.com/matlabcentral/fileexchange/24885-remove-outliers

--Improvements:
1. Handleing NaNs in inputs
2. Number of outliers to be removed is restricted to a user defined maximum to avoid uncontrolled shrinking of input dataset
3. Filling outliers by NaNs to preserve original dimensions of the input vector/matrix; this is crucial when the input variable is supposed to be used with another variable with the same size (e.g., for plotting, regression calculations, etc.)
4. Indexes of the outliers that have been detected and removed are returned so that the user knows which records have been removed, and since the indexes are ordered from the most extreme (negative or positive) to less extreme outliers, user will know which point was in the farthest outliers.
5. Syntax and algorithm has been significantly improved, this includes the logic for detection of the outliers from the upper and lower limits. Logic to detect an outlier is solely based on the absolute distance of each record from the central point rather than detecting the outliers sequentially, which was the case in Vince Petaccio, 2009, where outliers were detected and removed by order of one from the upper and the next from the lower extremes. This code first arranges the extreme values (upper or lower) to one side of the sorted vector based on the absolute distance from the center (while preserving the original arrangment in the input vector) then removes the bottom line element if it meets outlier conditions. This process continues until num_outliers is reached.
6. This function is enhanced to handle both vectors and matrices.
7. Valuable feedback from the user community (especially a user under
the name of John D'Errico) helped to detect and fix some issues in the
algorithm, which were related to exceptions involved in detecting
special types of outliers (please refer to the comments section). These
issues are now fixed. However, this code won't be able to find outliers
in curvilinear fitted data (which was one of the issues raised). This is
because the underlying logic to detect the outliers in (modified)
Thompson's Tau method is deviation from the mean. Check the references
given above or a good statistical reference if you are not very
familiar with the concept of outliers removal. One thing you should
know is that no outliers is absolutely an outlier, it is always a
relative.

% --Examples:
% -Example 1. Vector input:
X0=[2.0, 3.0, -50.5, 4.0, 109.0, 6.0]
[X, outliers_idx] = outliers(X0, 2) %call function with vector input

% X =
% 2, 3, NaN, 4, NaN, 6
%
% outliers_idx =
% 5, 3
%
% -Example 2. Matrix input:
X0= [2.0, 3.0, -50.5, 4.0, 109.0, 6.0;
5.3, 7.0, 80.0, 2.0, NaN, 1.0;
5.1, 2.7, 3.8, 2.0, 3.5, 21.0]
[X, outliers_idx] = outliers(X0, 4) %call function with matrix input

% X =
% 2, 3, NaN, 4, NaN , 6;
% 5.3, 7, NaN, 2, NaN, 1;
% 5.1, 2.7, 3.8, 2, 3.5, NaN
%
% outliers_idx =
% %(i) (J) %annotated
% 1, 5;
% 2, 3;
% 1, 3;
% 3, 6;

Rob Campbell

Rob Campbell (view profile)

Yes, I confirm it finds the zero in John's example.

M Sohrabinia

M Sohrabinia (view profile)

@Rob: did you try the updated code?

Rob Campbell

Rob Campbell (view profile)

In John's first example, there is no "nicely fitted curve" yet the outlier is not spotted.

M Sohrabinia

M Sohrabinia (view profile)

@John D'Errico: Thanks for your valuable feedback, it was very detailed and helpful, like those comments that sometimes I get from good reviewers. However, I do not agree with your point about Jered's comment (what you interpret from that comment is not what I read from it unless there is a magic meaning there that I dont get it). Anyway, I have put more notes in the code which clarifies that this code won't be able to find an outliers on a nicely fitted curve. I have another code submitted here named 'regoutleirs' which might be useful to find outliers in bivariate fitted data. It uses the residuals vector.Thanks again for your useful feedback

John D'Errico

John D'Errico (view profile)

What Jered is trying to point is indeed a serious flaw in this code.

Consider these two examples. In the first case, I'll create an outlier that has large absolute value.

x = rand(10,1) + 10;
x(3) = 20;
y = outliers(x)
y =
10.439
10.382
NaN
10.795
10.187
10.49
10.446
10.646
10.709
10.755

outliers finds the element that is inconsistent with the data, and replaces it with nan. No problem there.

However, suppose we create a point that is just as much of an outlier? (In fact, it is more significantly an outlier by some measures, if we compare it to the mean of the remainder of the data.)

x(3) = 0;
y = outliers(x)
y =
10.439
10.382
0
10.795
10.187
10.49
10.446
10.646
10.709
10.755

See that outliers fails to identify that point.

The problem is that this code does indeed look at the simple magnitude of the point. Points with large magnitude are flagged as an outlier.

Another example might be appropriate. I'll create a nice smooth curve here.

t = linspace(0,2*pi,21)';
x = sin(t);
x(11)
ans =
1.2246e-16

So the 11th point is zero. (essentially.) I'll change it to 1, a value that is clearly not on the nice smooth curve.

x(11) = 1;
y = outliers(x)
y =
0
0.30902
0.58779
0.80902
0.95106
1
0.95106
0.80902
0.58779
0.30902
1
-0.30902
-0.58779
-0.80902
-0.95106
-1
-0.95106
-0.80902
-0.58779
-0.30902
-2.4493e-16

Oops, outliers fails to find a point that is obviously not on the curve. I'll make the outlier more obvious this next time. Clearly x here will be always positive, but I'll make one element fairly clearly an outlier, thus completely inconsistent with the remainder of the curve.

x = sin(linspace(0,2*pi,10)')+2;
x(5) = -2
x =
2
2.6428
2.9848
2.866
-2
1.658
1.134
1.0152
1.3572
2

y = outliers(x)
y =
2
2.6428
2.9848
2.866
-2
1.658
1.134
1.0152
1.3572
2

Again, outliers fails to find that point.

Jered has pointed out that this tool assumes the data is centered around zero, then it looks for points that have large absolute magnitude, and discards those points.

So while I like SOME ASPECTS about the documentation in this tool, and I like the way the code was written, the tool simply has a major flaw. I've not looked at the references, so I do not know if the basic underlying algorithm is just poor, or if the implementation is poor. This code fails to find outliers in your data, UNLESS those outliers are of a very specific class, thus large in magnitude compared to the rest of the data. The code fails to use any information about whether the curve is smooth or noisy, it merely looks for points that are large in absolute value compared to the rest.

There are other problems with this tool. The author has put a blank line into the code after the fist line. So despite the voluminous set of help written, when you use help, you get essentially no help at all.

>> help outliers
outliers function: remove outliers based on Thompson Tau:

>>

Help looks for a CONTIGUOUS block of comments, and dumps that block out to the command line. A blank line in the middle cuts help off at the knees.

I wanted to give this code 3 stars because the author made some effort despite the MAJOR flaw with the algorithm, but the lack of usable help reduces my rating to 2 stars.

To the author, don't tell me I don't understand the code.

To any potential users, I would suggest you do as Jered said - be very careful about using this code. Code that finds only a very restricted class of outlier is not good at all, unless the only outliers you will ever see are those it can find.

M Sohrabinia

M Sohrabinia (view profile)

@Jered: I wonder why you are commenting and rating without understanding the code properly, just take a look at the examples given above, do they have zero mean? What were your errors using this code? You need to be specific if you really want to provide useful feedback.

Jered Wells

Jered Wells (view profile)

There are several errors in the code which assume the data have zero mean (an assumption which cannot be consistently held through algorithm operation). Could be greatly improved. Until then, users should be very critical of algorithm outputs.

M Sohrabinia

M Sohrabinia (view profile)

@Ahmad: At this stage, my function can only handle a column/row vector or a matrix but you can call this function three times each time giving one matrix of your 3D matrix, it will remove outliers from that matrix according to Thompson Tau rule. If you need to remove the outliers from the 3D matrix considering all values in one go, you would need to concatenate the 3D matrix into a single matrix, call this function and then reconstruct the 3D matrix in the same order you had concatenated.

Is a way to remove outliers in matrixes in 3D spaces?
i.e. if we have a matrix by 1000*3 and each row indicate a point in 3D space and wanting to remove points that are outlier?

George Duffy

Tobin

Tobin (view profile)

Seems to work fine, easy to use, practical implementation. Good job! Thanks!

Tobin

Rhys