Code covered by the BSD License  

Highlights from
Outlier removal v2

3.0

3.0 | 1 rating Rate this file 17 Downloads (last 30 days) File Size: 1.94 KB File ID: #37344

Outlier removal v2

by

 

29 Jun 2012 (Updated )

Rosner's many outlier test, vectorized.

| Watch this File

File Information
Description

Updated Jaco de Groot's outlier test (here: http://www.mathworks.com/matlabcentral/fileexchange/11106-outlier-removal). The code is now entirely vectorized and will probably run faster for large data sets.

There was also an issue in the earlier version where the pcrit and lambda were calculated incorrectly. This has been fixed. Here is his original description:

Grubb's outlier test can be used to remove one outlier (see deleteoutliers.m by Brett Schoelson). If you decide to remove this outlier, you might be tempted to run Grubbs' test again to see if there is a second outlier in your data. However, if you do this, you cannot use the same rejection criteria. Rosner has extended Grubb's method to detect several outliers in one dataset((Rosner, B., 1983. Percentage points for a generalized ESD
many-Outlier Procedure. Technometrics 25, 165?172.))

Rosner's many outliers test is implemented in this Matlab file. For good results, the dataset should be normally distributed after removal of the outliers (this can be tested for by "Pearson Chi Square Hypothesis Test" written by G. Levin). If the dataset is not normally distributed, usually the logarithm of the data will be.

I did not use the original reference by Rosner, but I used a paper that is available for download as PDF ( "Quality control of semi-continuous mobility size-fractionated
particle number concentration data", Atmospheric Environment 38 (2004) 3341?3348, Rong Chun Yu,*, Hee Wen Teh, Peter A. Jaques, Constantinos Sioutas,
John R. Froines)

Acknowledgements

Outlier Removal inspired this file.

Required Products Statistics Toolbox
MATLAB release MATLAB 7.13 (R2011b)
Tags for This File   Please login to tag files.
Please login to add a comment or rating.
Comments and Ratings (5)
20 Jul 2013 Romesh

The vectorizing ideas are nice, but I think there are two problems with the method

1. When calculating R the actual data should be used (i.e. mean and std of the data) but after you do

[ ys, is ] = sort( abs( y - ybar ));

you construct yy from ys and calculate R using nanmean and nanstd from yy, which is being applied to the magnitude of the difference between the original data and the mean rather than the data itself. You probably need something 'ys = y(is)' before constructing yy

2. GESD rejects at outliers all of the points up to the last R_i > lambda_i so if you have Ri > lambda_i being [1 1 0 1 0 0] you need to reject 4 points, not 3

Lastly, the messing around with -Inf is really ugly. You should use tril to get an array corresponding to the lower entries which you can then insert the NaNs from. For example,

yy = rot90(triu(repmat(ys, 1, k), k-n), 2)' + tril(nan(size(yy)),-1);

04 Jul 2012 Paul

Pasco:

Thanks for the advice! Somehow when I tested it with a bunch of 0s and 1s the -NaN thing seemed to work, dunno why. I've changed it at least temporarily to be -Inf. I'll look into modifying how the mean/std is done at some point, since this is clearly a placeholder.

04 Jul 2012 Pasco Alquim

Paul,

I didn't study your program so that I can propose alternatives but my comment goes to the fact that

1- There is no such a thing as a "-NaN". NaNs are not positive or negative. They are just ... NaN

2- yy(yy == -NaN) = 0; will never work because by definition all numbers are different from a NaN, including the NaN itself (just try (NaN == NaN)). But you can get what you want with

yy(isnan(yy)) = 0;

As regarding the usage of NaN instead of zeros, well perhaps you can

1- count them and remove them temporarily.
2- compute the means and stand deviation taking into account the true number of non-zero values (that's for sure what nanmean is doing when it ignores the NaNs)

03 Jul 2012 Paul

Pasco: Do you have an alternative way to do that? I'd be very interested to know, because I couldn't find anything good that still vectorizes the code.

The key is that I make an upper-triangular matrix there, and I need the mean and standard deviation of the upper triangular half. However, if I leave zeros in the bottom half, they are included in the mean. So the only way I saw to do it is to use nanmean and nanstd and set those zeros to NaN instead of 0.

The problem, of course, is that on the rare occasions where you have zeros in your original matrix, those will end up being set to NaN during the triangular matrix step.

Using logical indexing on the yy matrix doesn't preserve the matrix shape, and thus cannot be used. This seems like the fastest way, though I will admit that if you have -NaN in your data for some reason, it will cause problems (but honestly, wouldn't it cause problems anyway?)

03 Jul 2012 Pasco Alquim

A bit difficult to trust in a file that does things like:

ys(ys == 0) = -NaN;

or

yy(yy == -NaN) = 0;

Updates
02 Jul 2012

Fixed a bug where k = 1.

05 Jul 2012

Changed -NaN code to -Inf. Will eventually work out a more reasonable solution.

10 Apr 2013

Didn't put the original Jacco script in the acknowledgement box originally.

Contact us