Divide the real line into equiprobable intervals
This functionality does not run in MATLAB.
stats::equiprobableCells(k, q, <NoWarning>)
stats::equiprobableCells is a utility function for the classical chi-square test implemented by stats::csGOFT. The call stats::equiprobableCells(k, q) creates a list of intervals ("cells") that are equiprobable with respect to the statistical distribution corresponding to the quantile function q.
The chi-square goodness-of-fit test needs a cell partitioning of the real line to compare the empirical frequencies of data falling into the cells with the expected frequencies corresponding to a hypothesized statistical distribution. It is recommended to use equiprobable cells in this test. stats::equiprobableCells is a utility function to compute such a partitioning.
The cell boundaries bi of the returned cell partitioning [[b0, b1], …, [bk - 1, bk]] are computed via . Mathematically, each cell [bi - 1, bi] corresponds to a semi-open interval .
If q is the quantile function of a continuous statistical distribution, all cells have the same cell probability .
The function q can be a quantile procedure provided by the MuPAD® stats-library.
Quantile functions not provided by the stats-package can be implemented easily by the user. A user defined quantile procedure q can correspond to any statistical distribution. Quantile functions must accept one numerical floating-point parameter x satisfying 0.0 ≤ x ≤ 1.0. The call q(x) must produce a real value. In particular, the return values q(0.0) = -infinity and q(1.0) = infinity are allowed.
Quantile functions must be monotonically increasing. stats::equiprobableCells issues warnings if the computed quantile values are not real or , or if these values do not increase monotonically.
stats::equiprobableCells also accepts quantile functions of discrete distributions such as stats::empiricalQuantile(data) or stats::binomialQuantile(n, p).
Note: Note, however, that in general, there are no equiprobable cell partitionings for discrete distributions. Consequently, equiprobability of the cells returned by stats::equiprobableCells is not guaranteed if q is not a continuous function.
In particular, it may happen for large k, that coincides with , i.e., the corresponding cell is empty. This will always happen, when k exceeds the number of possible discrete values the random variable can attain.
In such a case, a warning is issued. Passing such a cell partitioning to stats::csGOFT raises an error.
Further to the examples on this help page, see also the examples on the help page of stats::csGOFT.
The function is sensitive to the environment variable DIGITS which determines the numerical working precision.
We divide the real line into 4 intervals that are equiprobable with respect to the standard normal distribution:
k:= 4: q := stats::normalQuantile(0, 1): cells := stats::equiprobableCells(k, q)
We check equiprobability by applying the function stats::normalCDF(0, 1) to the cell boundaries:
cdf := stats::normalCDF(0, 1): p := map(cells, map, cdf)
The cell probabilities are given by the differences of the CDF function applied to the cell boundaries:
(p[i] - p[i]) $ i = 1..k
We use these cells for a chi-square test for normality of some random data:
r := stats::normalRandom(0, 1, Seed = 0): data := [r() $ i = 1..1000]: stats::csGOFT(data, cells, CDF = cdf)
With the observed significance level , the data pass this test well. We experiment with other equiprobable cell partitionings:
for k in [20, 30, 40, 50] do cells := stats::equiprobableCells(k, q); print(stats::csGOFT(data, cells, CDF = cdf)); end_for:
delete k, cells, p, cdf, r, data:
We create a sample of 1000 random integers between 0 and 100:
SEED := 10^2: r := random(0 .. 100): data := [r() $ i = 1..1000]:
We construct an `equiprobable' cell partitioning of 10 cells using the (discrete) empirical distribution of the data. I.e., each of the following cells should contain approximately the same number of data from the random sample:
k := 10: quantile := stats::empiricalQuantile(data): cells := stats::equiprobableCells(k, quantile)
For discrete distributions, `equiprobability' can only be achieved approximately. We compute the cell probabilities with respect to the empirical cumulative distribution function (CDF), by subtracting the CDF value of the left boundary from the CDF value of the right boundary:
cdf := stats::empiricalCDF(data): map(cells, cell -> cdf(cell) - cdf(cell))
The actual empirical frequency of the data in each cell is the cell probability times the sample size (1000):
map(cells, cell -> 1000*(cdf(cell) - cdf(cell)))
When computing the probability of the cell [b[i-1], b[i]] via cdf(bi) - cdf(bi - 1), the cell is regarded as the semiopen interval mathematically. For this reason, the data points 0 contained in the sample are not counted, and the cell frequencies do not quite add up to the sample size:
For the Symbol::chi^2 test, this does not matter because it replaces the left boundary of the first cell by -infinity, anyway. With an observed significance level of , the data pass the test for a uniform distribution at levels as high as :
stats::csGOFT(data, cells, CDF = stats::uniformCDF(0, 100))
[m, v] := [stats::mean(data), stats::variance(data)]; stats::csGOFT(data, cells, CDF = stats::normalCDF(m, v))
With the observed significance level , the hypothesis of a normal distribution clearly has to be rejected.
delete r, data, k, quantile, cells, cdf, m, v:
We consider a binomial distribution with `trial parameter' n = 100 and `probability parameter' . It is the distribution of the number of successes in n = 100 independent Bernoulli experiments, each with success probability . This random variable can attain the discrete values 0, 1, …, 100. We create a cell partitioning of 4 cells:
n := 100: p := 1/2: quantile := stats::binomialQuantile(n, p): cells := stats::equiprobableCells(4, quantile)
Because of discreteness, an exact equiprobable cell partitioning does not exist. We compute the expected cell frequencies in the same way as in the previous example:
cdf := stats::binomialCDF(n, p): map(cells, cell -> n*(cdf(cell) - cdf(cell)))
We create a random sample and apply the Symbol::chi^2 test:
r := stats::binomialRandom(n, p, Seed = 123): data := [r() $ i = 1..100]: stats::csGOFT(data, cells, CDF = cdf)
The observed significance level is not small, i.e., the data pass the test well.
The `trial parameter' n = 100 is large enough for the binomial distribution to be approximated by a normal distribution with mean n p and variance n p (1 - p). The data pass the test for a normal distribution, too:
cdf := stats::normalCDF(n*p, n*p*(1 - p)): stats::csGOFT(data, cells, CDF = cdf)
We repeat the test with another cell partitioning:
quantile := stats::normalQuantile(n*p, n*p*(1 - p)): cells := stats::equiprobableCells(4, quantile)
stats::csGOFT(data, cells, CDF = cdf)
delete k, quantile, cells, cdf, r, data:
We demonstrate user-defined quantile functions. We consider the following distribution of a random variable X supported on the interval [0, 1]:
The quantile function q is given by for 0 ≤ x ≤ 1:
quantile := x -> sqrt(x):
We test the hypothesis that the following data are distributed as defined above.
cells := stats::equiprobableCells(6, quantile)
data := [sqrt(frandom()) $ i = 1..10^3]: cdf := proc(x) begin if x <= 0 then return(0) elif x <= 1 then return(x^2) else return(1) end_if end_proc: stats::csGOFT(data, cells, CDF = cdf)
The data pass the test well. In fact, for a uniform deviate Y on the interval [0, 1] (as produced by frandom), the cumulative distribution function of is indeed given by cdf.
delete quantile, cells, data, cdf:
The number of cells: a positive integer
A procedure representing a quantile function of a statistical distribution. Typically, q is one of the quantile functions of the stats-package such as stats::normalQuantile(m, v), stats::empiricalQuantile(data) etc. Alternatively, user defined procedures may be passed if the stats-package does not provide a suitable quantile function.
stats::equiprobableCells issues warnings if the computed cell partitioning is not suitable for stats::csGOFT. These warnings may be switched off with this option.
List of k "cells"
with floating-point values . This `cell partitioning' is suitable as input parameter for stats::csGOFT.