Documentation |
Exclude data from fit
outliers = excludedata(xdata,ydata,MethodName,MethodValue)
outliers = excludedata(xdata,ydata,MethodName,MethodValue) identifies data to be excluded from a fit using the specified MethodName and MethodValue. outliers is a logical vector, with 1 marking predictors (xdata) to exclude and 0 marking predictors to include. Supported MethodName and MethodValue pairs are given in the table below.
You can use the output outliers as an input to the fit function in the Exclude name-value pair argument. You can alternatively use the Exclude argument to specify excluded data as:
An expression describing a logical vector, e.g., x > 10.
A vector of integers indexing the points you want to exclude, e.g., [1 10 25].
MethodName | MethodValue |
---|---|
'box' | A four-element vector specifying the edges of a closed box in the xy-plane, outside of which data is to be excluded from a fit. The vector has the form [xmin xmax ymin ymax]. |
'domain' | A two-element vector specifying the endpoints of a closed interval on the x-axis, outside of which data is to be excluded from a fit. The vector has the form [xmin xmax]. |
'indices' | A vector of indices specifying the data points to be excluded. |
'range' | A two-element vector specifying the endpoints of a closed interval on the y-axis, outside of which data is to be excluded from a fit. The vector has the form [ymin ymax]. |
Load the vote counts and county names for the state of Florida from the 2000 U.S. presidential election:
load flvote2k
Use the vote counts for the two major party candidates, Bush and Gore, as predictors for the vote counts for third-party candidate Buchanan, and plot the scatters:
plot(bush,buchanan,'rs') hold on plot(gore,buchanan,'bo') legend('Bush data','Gore data')
Assume a model where a fixed proportion of Bush or Gore voters choose to vote for Buchanan:
f = fittype({'x'}) f = Linear model: f(a,x) = a*x
Exclude the data from absentee voters, who did not use the controversial "butterfly" ballot:
absentee = find(strcmp(counties,'Absentee Ballots')); nobutterfly = excludedata(bush,buchanan,... 'indices',absentee);
Perform a bisquare weights robust fit of the model to the two data sets, excluding absentee voters:
bushfit = fit(bush,buchanan,f,... 'Exclude',nobutterfly,'Robust','on'); gorefit = fit(gore,buchanan,f,... 'Exclude',nobutterfly,'Robust','on');
Robust fits give outliers a low weight, so large residuals from a robust fit can be used to identify the outliers:
figure plot(bushfit,bush,buchanan,'rs','residuals') hold on plot(gorefit,gore,buchanan,'bo','residuals')
The residuals in the plot above can be computed as follows:
bushres = buchanan - feval(bushfit,bush); goreres = buchanan - feval(gorefit,gore);
Large residuals can be identified as those outside the range [-500 500]:
bushoutliers = excludedata(bush,bushres,... 'range',[-500 500]); goreoutliers = excludedata(gore,goreres,... 'range',[-500 500]);
The outliers for the two data sets correspond to the following counties:
counties(bushoutliers) ans = 'Miami-Dade' 'Palm Beach' counties(goreoutliers) ans = 'Broward' 'Miami-Dade' 'Palm Beach'
Miami-Dade and Broward counties correspond to the largest predictor values. Palm Beach county, the only county in the state to use the "butterfly" ballot, corresponds to the largest residual values.