version 1.0 (4.74 KB) by
M Sohrabinia

Removes outliers from X and Y variables based on regression residuals

This function accepts two (vector of) variables for which a bivariate linear regression analysis is meant to be performed, and removes the outliers from both variables. Since the regression residual vector is used to detect the outliers, only those records which stand farthest from the 1:1 regression line will be detected and removed. If more than one outliers is asked to be removed, before removing the next outlier, regression residuals will be recalculated to avoid swamping and masking effects, then the next farthest point from 1:1 line will be removed and so forth. This method differentiates those points that might be outlier in a single variable (X or Y) but can fit well in a 1:1 regression line-fit from those points that stay in the acceptable range in each of the individual input variables (X,Y) but can appear in the outliers when the two variables are fitted in the regression line. To detect the outlier from the residual's vector, a subfunction is used (this subfunction is an enhancement from a work by Vince Petaccio, 2009, and is available also as a stand-alone function, "outliers", from Matlab File exchange).

--Inputs:

X0: vector of dependent variable in bivariate linear regression

Y0: vector of independent variable in bivariate linear regression

noutliers: how many outliers should be removed? (1 will be used as default if not provided)

plotOp: plotting option, whether to produce a scatterplot of the two input variables before and after each iteration of outliers removal (up to noutliers) or only do calculations (0: don’t plot, 1: plot), if 1 is given, plots will be generated in a subplot

--Outputs:

X: vector of dependent variable after removal of the outliers

Y: vector of independent variable after removal of the outliers

rSquares: a vector of r-square values calculated from the original inputs and after removal of each outlier

outliers_idx: indexes of outliers, note that records for these indexes are turned to NaN in X and Y outputs

--Dependency:

outliers subfunction, which is included in this code following main function

--Example:

X0=10.2:0.2:30; first vector

Y0=0.1:0.1:10; second vector

idx=randi(length(Y0),4,1); %randomly distribute 4 noise

Y0(idx)=randn(4,1)*10; %produce 4 random noise

noutliers=3; %number of outliers to remove

plotOp=1; %0: dont plot, 1: plot

[X,Y,rSquares,outliers_idx]=regoutliers(X0,Y0,noutliers,plotOp);

rSquares %print rsquare values calculated from original %

%data and each step after removal of outliers, this

%should show progressively increasing values, otherwise

%number of outliers to be removed should be decreased or

%in some cases increased.

outliers_idx %print indexes of outliers in both input vectors