Divide the real line into equiprobable intervals
This functionality does not run in MATLAB.
stats::equiprobableCells is a utility function
for the classical chi-square test implemented by
stats::csGOFT. The call
q) creates a list of intervals ("cells") that
are equiprobable with respect to the statistical distribution corresponding
to the quantile function
The chi-square goodness-of-fit
test needs a cell partitioning of the real line to compare
the empirical frequencies of data falling into the cells with the
expected frequencies corresponding to a hypothesized statistical distribution.
It is recommended to use equiprobable cells in this test.
a utility function to compute such a partitioning.
The cell boundaries bi of the returned cell partitioning [[b0, b1], …, [bk - 1, bk]] are computed via . Mathematically, each cell [bi - 1, bi] corresponds to a semi-open interval .
q is the quantile function of a continuous statistical
distribution, all cells have the same cell probability
q can be a quantile procedure
provided by the MuPAD®
Quantile functions not provided by the
can be implemented easily by the user. A user defined quantile procedure q can
correspond to any statistical distribution. Quantile functions must
accept one numerical floating-point parameter x satisfying 0.0
≤ x ≤ 1.0. The
q(x) must produce a real value. In particular,
the return values
infinity are allowed.
Quantile functions must be monotonically increasing.
warnings if the computed quantile values
not real or
or if these values do not increase monotonically.
stats::equiprobableCells also accepts quantile
functions of discrete distributions such as
Note, however, that in general, there are no equiprobable cell
partitionings for discrete distributions. Consequently, equiprobability
of the cells returned by
In particular, it may happen for large k, that coincides with , i.e., the corresponding cell is empty. This will always happen, when k exceeds the number of possible discrete values the random variable can attain.
In such a case, a warning is issued. Passing such a cell partitioning
Further to the examples on this help page, see also the examples
on the help page of
The function is sensitive to the environment variable
determines the numerical working precision.
We divide the real line into 4 intervals that are equiprobable with respect to the standard normal distribution:
k:= 4: q := stats::normalQuantile(0, 1): cells := stats::equiprobableCells(k, q)
We check equiprobability by applying the function
1) to the cell boundaries:
cdf := stats::normalCDF(0, 1): p := map(cells, map, cdf)
The cell probabilities are given by the differences of the CDF function applied to the cell boundaries:
(p[i] - p[i]) $ i = 1..k
We use these cells for a chi-square test for normality of some random data:
r := stats::normalRandom(0, 1, Seed = 0): data := [r() $ i = 1..1000]: stats::csGOFT(data, cells, CDF = cdf)
With the observed significance level , the data pass this test well. We experiment with other equiprobable cell partitionings:
for k in [20, 30, 40, 50] do cells := stats::equiprobableCells(k, q); print(stats::csGOFT(data, cells, CDF = cdf)); end_for:
delete k, cells, p, cdf, r, data:
We create a sample of 1000 random integers between 0 and 100:
SEED := 10^2: r := random(0 .. 100): data := [r() $ i = 1..1000]:
We construct an `equiprobable' cell partitioning of 10 cells using the (discrete) empirical distribution of the data. I.e., each of the following cells should contain approximately the same number of data from the random sample:
k := 10: quantile := stats::empiricalQuantile(data): cells := stats::equiprobableCells(k, quantile)
For discrete distributions, `equiprobability' can only be achieved approximately. We compute the cell probabilities with respect to the empirical cumulative distribution function (CDF), by subtracting the CDF value of the left boundary from the CDF value of the right boundary:
cdf := stats::empiricalCDF(data): map(cells, cell -> cdf(cell) - cdf(cell))
The actual empirical frequency of the data in each cell is the cell probability times the sample size (1000):
map(cells, cell -> 1000*(cdf(cell) - cdf(cell)))
When computing the probability of the cell
b[i]] via cdf(bi)
- cdf(bi -
1), the cell is regarded as the semiopen
For this reason, the data points
0 contained in
the sample are not counted, and the cell frequencies do not quite
add up to the sample size:
For the Symbol::chi^2 test,
this does not matter because it replaces the left boundary of the
first cell by
infinity, anyway. With an observed significance
the data pass the test for a uniform distribution at levels as high
stats::csGOFT(data, cells, CDF = stats::uniformCDF(0, 100))
[m, v] := [stats::mean(data), stats::variance(data)]; stats::csGOFT(data, cells, CDF = stats::normalCDF(m, v))
With the observed significance level , the hypothesis of a normal distribution clearly has to be rejected.
delete r, data, k, quantile, cells, cdf, m, v:
We consider a binomial distribution with `trial parameter' n = 100 and `probability parameter' . It is the distribution of the number of successes in n = 100 independent Bernoulli experiments, each with success probability . This random variable can attain the discrete values 0, 1, …, 100. We create a cell partitioning of 4 cells:
n := 100: p := 1/2: quantile := stats::binomialQuantile(n, p): cells := stats::equiprobableCells(4, quantile)
Because of discreteness, an exact equiprobable cell partitioning does not exist. We compute the expected cell frequencies in the same way as in the previous example:
cdf := stats::binomialCDF(n, p): map(cells, cell -> n*(cdf(cell) - cdf(cell)))
We create a random sample and apply the Symbol::chi^2 test:
r := stats::binomialRandom(n, p, Seed = 123): data := [r() $ i = 1..100]: stats::csGOFT(data, cells, CDF = cdf)
The observed significance level is not small, i.e., the data pass the test well.
The `trial parameter' n = 100 is large enough for the binomial distribution to be approximated by a normal distribution with mean n p and variance n p (1 - p). The data pass the test for a normal distribution, too:
cdf := stats::normalCDF(n*p, n*p*(1 - p)): stats::csGOFT(data, cells, CDF = cdf)
We repeat the test with another cell partitioning:
quantile := stats::normalQuantile(n*p, n*p*(1 - p)): cells := stats::equiprobableCells(4, quantile)
stats::csGOFT(data, cells, CDF = cdf)
delete k, quantile, cells, cdf, r, data:
We demonstrate user-defined quantile functions. We consider the following distribution of a random variable X supported on the interval [0, 1]:
The quantile function q is given by for 0 ≤ x ≤ 1:
quantile := x -> sqrt(x):
We test the hypothesis that the following data are distributed as defined above.
cells := stats::equiprobableCells(6, quantile)
data := [sqrt(frandom()) $ i = 1..10^3]: cdf := proc(x) begin if x <= 0 then return(0) elif x <= 1 then return(x^2) else return(1) end_if end_proc: stats::csGOFT(data, cells, CDF = cdf)
The data pass the test well. In fact, for a uniform deviate Y on
the interval [0, 1] (as produced
the cumulative distribution function of
indeed given by cdf.
delete quantile, cells, data, cdf:
The number of cells: a positive integer
A procedure representing a quantile
function of a statistical distribution. Typically,
List of k "cells"
with floating-point values
This `cell partitioning' is suitable as input parameter for