Clamp (winsorize) extremal values
This functionality does not run in MATLAB.
stats::winsorize([x_{1}, x_{2}, …]
,α
) stats::winsorize([[x_{11}, x_{12}, …], [x_{21}, x_{22}, …], …]
,α
,i
) stats::winsorize(s
,α
,i
)
stats::winsorize([x_{1}, x_{2},
…], α)
returns a copy of [x_{1}, x_{2},
…] in which all entries smaller than the α quantile
have been replaced by this value and likewise for all entries larger
than the 1  α quantile.
stats::winsorize([[x_{11}, x_{12},
…], [x_{21}, x_{22},
…], …], α, i)
and stats::winsorize(stats::sample([[x_{11},
x_{12}, …], [x_{21},
x_{22}, …], …]), α, i)
perform
the operations described above on the ith
entries of the input rows.
Measurement data often contains "outliers," sample points rather far outside the range containing the majority of the points. While expected both from theory and experience, these outliers, for small or mediumsized samples, tend to distort statistical data such as the mean value.
One of the standard methods dealing with this problem for (real)
continuous scales is clamping the outliers. stats::winsorize
sets
all data points below or above a given quantile to these quantiles.
(This operation is named after its inventor, Charles P. Winsor.)
We create a normally distributed sample, slightly contaminated:
r := stats::normalRandom(0, 1, Seed=2): data := [r() $ i = 1..300, 100*r() $ i = 1..2]:
The two extra points distort the data significantly:
plot(plot::Histogram2d(data, Cells=20))
Using either stats::winsorize
or stats::cutoff
removes
this noise and the image shows more detail:
plot(plot::Scene2d(plot::Histogram2d (stats::winsorize(data, 1/100), Cells=20)), plot::Scene2d(plot::Histogram2d (stats::cutoff(data, 1/100), Cells=20)))
With larger values of α, the difference between the two is easier to see:
plot(plot::Scene2d(plot::Histogram2d (stats::winsorize(data, 1/20), Cells=20)), plot::Scene2d(plot::Histogram2d (stats::cutoff(data, 1/20), Cells=20)))
Both stats::winsorize
and stats::cutoff
reduce
the standard deviation of the sample. This effect is considerably
stronger for stats::cutoff
, though. Keeping in mind
that the standard deviation of our random number generator is 1,
we compute that of the data in its various forms:
stats::stdev(data), stats::stdev(stats::winsorize(data, 1/20)), stats::stdev(stats::cutoff(data, 1/20))

The statistical data: arithmetical expressions. The data to filter on must be realvalued. 

Sample of type 

Cutoff parameter: a realvalued expression . 

Column index: positive integer. The nested list or the sample is winsorized on its ith column. 
The input data with outliers being replaced by the values of quantiles.