Classical chisquare goodnessoffit test
This functionality does not run in MATLAB.
stats::csGOFT(x_{1}, x_{2}, …
,[[a_{1}, b_{1}], [a_{2}, b_{2}], …]
,CDF = f  PDF = f  PF = f
) stats::csGOFT([x_{1}, x_{2}, …]
,[[a_{1}, b_{1}], [a_{2}, b_{2}], …]
,CDF = f  PDF = f  PF = f
) stats::csGOFT(s
, <c
>,[[a_{1}, b_{1}], [a_{2}, b_{2}], …]
,CDF = f  PDF = f  PF = f
)
stats::csGOFT(data, cells, CDF = f)
applies
the classical chisquare goodnessoffit test for the null hypothesis:
"the data are fdistributed".
The chisquare goodnessoffit test divides the real line into k intervals
('the
cells'). It computes the number of data x_{j} falling
into the cells c_{i} and
compares these 'empirical cell frequencies' with the 'expected cell
frequencies' n p_{i},
where n is
the sample size and p_{i} = Pr(a_{i} < x ≤ b_{i}) are
the 'cell probabilities' of a random variable with the hypothesized
distribution specified by X = f
.
All data x_{1}
, x_{2}
etc.
must be convertible to real floatingpoint numbers. The data do not
have to be sorted on input: stats::csGOFT
automatically
converts the data to floats and sorts them internally.
External statistical data stored in an ASCII file can be imported
into a MuPAD^{®} session via import::readdata
. In particular, see
Example 1 of the corresponding help page.
Finite cell boundaries a_{i}, b_{i} must be convertible to real floatingpoint numbers satisfying a_{1} < b_{1} ≤ a_{2} < b_{2} ≤ a_{3} < …. They define semiopen intervalls .
When the hypothesized distribution f is
specified as a cumulative distribution function (CDF
= f
),
the left boundary of the first cell and the right boundary of the
last cell are ignored. They are replaced by 
∞ and infinity,
respectively, i.e., the cell partitioning
is used internally.
The cells must be disjoint. Their union must cover the support
area of the distribution, i.e., the 'cell probabilities' p_{i} = Pr(a_{i} < x ≤ b_{i}) must
add up to 1 for a random
variable x with
the hypothesized distribution given by f
. For continuous
distributions, adjacent cells with b_{1} = a_{2}, b_{2} = a_{3},
… are appropriate.
You may use a_{1} =  ∞ and b_{k} = ∞ for distributions supported on the entire real line.
Note: The cells must be chosen such that no cell probability p_{i} vanishes! 
See the `Background' section of this help page for recommendations
on the cell partitioning. In particular, the use of equiprobable cells
(with constant p_{i})
is recommended. For convenience, a utility function stats::equiprobableCells
is
provided to generate such cells. See Example 1, Example 3, and Example 4.
The distribution the data are tested for is specified by the
equation X = f
, where X
is one
of the flags CDF
, PDF
or PF
.
For efficiency, it is recommended to specify a cumulative distribution
function (CDF
= f
).
The function f
can be a procedure provided
by the MuPAD stats
library. Specifications
such as CDF
= stats::normalCDF(m, v)
or CDF
= stats::poissonCDF(m)
with
suitable numerical values of m
, v
are
possible and recommended.
Distributions that are not provided by the stats
package
can be implemented easily by the user. A user defined procedure f can
implement any distribution function. In the CDF case, stats::csGOFT
calls f
with
the boundary values a_{i}, b_{i} of
the cells to compute the cell probabilities via p_{i} = f(b_{i})
 f(a_{i}) (automatically
setting f(a_{1})
= 0 and f(b_{k})
= 1).
The function f
must return a numerical real
value between 0 and 1.
See Example 5 and Example 6.
Alternatively, the function f
can be specified
by a univariate arithmetical expression g(x) depending
on a symbolic variable x
. It is interpreted as
the function
.
Cf. Example 6.
See the `Background' section of this help page for further information
on the specification of the distribution via CDF
= f
, PDF
= f
or PF
= f
.
The call stats::csGOFT(data, cells, X = f)
returns
the list [PValue = p, StatValue = s, MinimalExpectedCellFrequency = m]:
s
is the observed value of the
chisquare statistic
,
where n is the sample size, k is the number of cells, y_{i} is the observed cell frequency of the data (i.e., y_{i}is the number of data x_{j} falling into the cell c_{i}), and p_{i} is the cell probabilitiy corresponding to the hypothesized distribution f.
p is the observed significance level of the chisquare statistic with k  1 degrees of freedom, i.e., p = 1  stats::chisquareCDF(k  1)(s)
is the minimum of the expected cell frequencies n p_{i}. This information is provided by the test to make sure that the boundary conditions for a "reasonable" cell partitioning are met (see the "Background" section of this help page).
The most relevant information returned by stats::csGOFT
is
the observed significance level PValue = p
. It
has to be interpreted in the following way: Under the null hypothesis,
the chisquare statistic
is approximately chisquare distributed (for large samples):
.
Under the null hypothesis, the probability p = Pr(S > s) should
not be small, where s
is the value of the statistic
attained by the sample.
Specifically, p = Pr(S > s) ≥ α should hold for a given significance level 0 < α < 1, If this condition is violated, the hypothesis may be rejected at level α.
Thus, if the PValue (observed significance level) p
= Pr(S > s) satisfies p <
α, the sample leading to the observed value s
of
the statistic S represents
an unlikely event, and the null hypothesis may be rejected at level α.
On the other hand, values of p close to 1 should raise suspicion about the randomness of the data: they indicate a fit that is too good.
The function is sensitive to the environment variable DIGITS
which
determines the numerical working precision.
We consider random data that should be normally distributed with mean 15 and variance 2:
f := stats::normalRandom(15, 2, Seed = 0): data := [f() $ i = 1..1000]:
According to the recommendations in the `Background' section of this help page, the number of cells should be approximately , where n = 1000 is the sample size.
We wish to use 32 cells that are equiprobable with respect to the hypothesized normal distribution. We estimate the mean m and the variance v of the data:
[m, v] := [stats::mean(data), stats::variance(data, Sample)]
The utility function stats::equiprobableCells
is
used to compute an equiprobable cell partitioning via the quantile
function of the normal distribution with the empirical parameters:
cells := stats::equiprobableCells(32, stats::normalQuantile(m, v)): stats::csGOFT(data, cells, CDF = stats::normalCDF(m, v))
The observed significance level attained by the sample is not small. Hence, one should not reject the hypothesis that the sample is normally distributed with mean and variance .
In the following, we impurify the sample by appending some uniformly distributed numbers. A new equiprobable cell partitioning appropriate for the new data is computed:
r := stats::uniformRandom(10, 20, Seed = 0): data := append(data, r() $ 40): [m, v] := [stats::mean(data), stats::variance(data, Sample)]: k := round(2*nops(data)^(2/5)): cells := stats::equiprobableCells(k, stats::normalQuantile(m, v)): stats::csGOFT(data, cells, CDF = stats::normalCDF(m, v))
The impure data may be rejected as a normally distributed sample at levels as small as .
delete f, data, m, v, k, cells, r:
We create a sample of random data that should be binomially distributed with trial parameter 70 and probability parameter :
r := stats::binomialRandom(70, 1/2, Seed = 123): data := [r() $ k = 1..1000]:
With the expectation value of 35 and the standard deviation of of this distribution, we expect most of the data to have values between 30 and 40. Thus, a cell partitioning consisting of 12 cells corresponding to the intervals
should be appropriate. Note that all cells are interpreted as
the intervals
,
i.e., the left boundary is not included in the interval. Strictly
speaking, the value 0 is
not covered by these cells. However, with a CDF
specification, stats::csGOFT
ignores
the leftmost boundary and replaces it by 
infinity
. Thus, the
union of the cells does cover all integers 0,
…, 70 that can be attained by the hypothesized
binomial distribution with `trial parameter' 70:
cells := [[0, 30], [i, i + 1] $ i = 30..39, [40, 70]]
We apply the χ^{2} test
with various specifications of the binomial distribution. They all
produce the same result. However, the first call using a CDF
specification
is the most efficient (fastest) call:
stats::csGOFT(data, cells, CDF = stats::binomialCDF(70, 1/2));
stats::csGOFT(data, cells, PF = stats::binomialPF(70, 1/2));
f := binomial(70, x)*(1/2)^x*(1/2)^(70  x): stats::csGOFT(data, cells, PF = f)
The observed significance level indicates that the data pass the test well.
Next, we dote the sample by appending the value 35 forty times:
data := data . [35 $ 40]: stats::csGOFT(data, cells, CDF = stats::binomialCDF(70, 1/2));
Now, the data may be rejected as a binomial sample with the specified parameters at levels as small as .
delete r, data, cells, f:
We test data that purport to be a sample of beta distributed
numbers with scale parameters 3 and 2.
Since beta deviates attain values between 0 and 1,
we choose an equidistant cell partitioning of the interval [0,
1] consisting of 10 cells.
Various equivalent calls to stats::csGOFT
are demonstrated:
r := stats::betaRandom(3, 2, Seed = 1): data := [r() $ i = 1..100]: cells := [[(i  1)/10, i/10] $ i = 1..10]: stats::csGOFT(data, cells, CDF = stats::betaCDF(3, 2)); stats::csGOFT(data, cells, CDF = (x > stats::betaCDF(3, 2)(x)))
Alternatively, the beta destribution may be passed by a PDF
specification.
This, however, is less efficient than the CDF
specification
used before:
stats::csGOFT(data, cells, PDF = stats::betaPDF(3, 2)); stats::csGOFT(data, cells, PDF = (x > stats::betaPDF(3, 2)(x)));
The observed significance level is not small. Hence, this test does not indicate that the data should be rejected as a beta distributed sample with the specified parameters. Note, however, that the minimal expected cell frequency given by the third element of the returned list is rather small. This indicates that the cell partitioning is not very fortunate. We investigate the expected cell frequencies by computing n p_{i} = n (f(b_{i})  f(a_{i})), where f is the cumulative distribution function of the beta distribution and n is the sample size:
f:= stats::betaCDF(3, 2): map(cells, cell > 100*(f(cell[2])  f(cell[1])))
These values show that the first two or three cells should be joined to a single cell. We modify the cell partitioning by joining the first three and the last two cells:
cells := [[0, 3/10], [(i  1)/10, i/10] $ i = 4..8, [8/10, 1]]
For this cell partitioning, the expected frequencies in a random sample of size 100 are sufficiently large for all cells:
map(cells, cell > 100*(f(cell[2])  f(cell[1])))
We apply another χ^{2} test with this improved partitioning:
stats::csGOFT(data, cells, CDF = f)
Again, with the observed significance level , the test does not give any hint that the data are not beta distributed with the specified parameters.
Now, we test whether the data can be regarded as being normally distributed. First, we estimate the parameters (mean and variance) required for the normal distribution:
[m, v] := [stats::mean(data), stats::variance(data, Sample)]
The cell partitioning used before was a partitioning of the interval [0, 1], because beta deviates attain values in this interval. Now we construct a partitioning of 7 equiprobable cells using the quantile function of the normal distribution:
k := 7: cells := stats::equiprobableCells(7, stats::normalQuantile(m, v))
Indeed, theses cells are equiprobable:
f:= stats::normalCDF(m, v): map(cells, cell > f(cell[2])  f(cell[1]))
We test for normality with the estimated mean and variance:
stats::csGOFT(data, cells, CDF = f)
With the observed significance level of
,
the data should not be rejected as a normally distributed sample.
We note that the nonparametric ShapiroWilk test implemented in stats::swGOFT
does detect
nonnormality of the sample:
stats::swGOFT(data)
With the observed significance level of , normality can be rejected at levels as low as .
delete r, data, cells, f, m, v, k, boundaries:
We demonstrate the use of samples of type stats::sample
. We create a sample consisting
of one string column and two nonstring columns:
s := stats::sample( [["1996", 1242, 156], ["1997", 1353, 162], ["1998", 1142, 168], ["1999", 1201, 182], ["2001", 1201, 190], ["2001", 1201, 190], ["2001", 1201, 205], ["2001", 1201, 210], ["2001", 1201, 220], ["2001", 1201, 213], ["2001", 1201, 236], ["2001", 1201, 260], ["2001", 1201, 198], ["2001", 1201, 236], ["2001", 1201, 245], ["2001", 1201, 188], ["2001", 1201, 177], ["2001", 1201, 233], ["2001", 1201, 270]])
"1996" 1242 156 "1997" 1353 162 "1998" 1142 168 "1999" 1201 182 "2001" 1201 190 "2001" 1201 190 "2001" 1201 205 "2001" 1201 210 "2001" 1201 220 "2001" 1201 213 "2001" 1201 236 "2001" 1201 260 "2001" 1201 198 "2001" 1201 236 "2001" 1201 245 "2001" 1201 188 "2001" 1201 177 "2001" 1201 233 "2001" 1201 270
We consider the data in the third column. The mean and the variance of these data are computed:
[m, v] := float([stats::mean(s, 3), stats::variance(s, 3, Sample)])
We check whether the data of the third column are normally distributed with the empirical mean and variance computed above. We compute an appropriate cell partitioning in the same way as explained in Example 1:
samplesize := s::dom::size(s): k := round(2*samplesize^(2/5)): cells := stats::equiprobableCells(k, stats::normalQuantile(m, v)): stats::csGOFT(s, 3, cells, CDF = stats::normalCDF(m, v))
Thus, the data pass the test.
delete s, m, v, samplesize, k, cells:
We demonstrate how userdefined distribution functions can be used. A die is rolled 60 times. The following frequencies of the scores 1, 2, …, 6 are observed:
score  1  2  3  4  5  6 ++++++ frequency  7  16  8  17  3  9
We test the null hypothesis that the dice is fair. Under this
hypothesis, the variable X given
by the score of a single roll attains the values 1 through 6 with
constant probability
.
Presently, the stats
package does not provide a
discrete uniform distribution, so we implement a corresponding cumulative
discrete distribution function f
:
f := proc(x) begin if x < 0 then 0 elif x <= 6 then trunc(x)/6 else 1 end_if; end_proc:
We create the data representing the 60 rolls:
data := [ 1 $ 7, 2 $ 16, 3 $ 8, 4 $ 17, 5 $ 3, 6 $ 9]:
We choose a collection of cells, each of which contains exactly one of the integers 1, …, 6:
Wir wählen sodann eine Zellzerlegung, so dass jede Zelle genau eine der ganzen Zahlen 1, …, 6 enthält:
cells := [[i  1/2, i + 1/2] $ i = 1..6]
stats::csGOFT(data, cells, CDF = f)
At a significance level as small as , the null hypothesis `the dice is fair' should be rejected.
delete f, data, cells:
We give a further demonstration of userdefined distribution functions. The following procedure represents the cumulative distribution function of a variable X supported on the interval . It will be called with values from the cell boundaries and must return numerical values between 0 and 1:
f := proc(x) begin if x <= 0 then return(0) elif x <= 1 then return(x^2) else return(1) end_if end_proc:
We test the hypothesis that the following data are fdistributed. The cells form an equidistant partitioning of the interval [0, 1]:
data := [sqrt(frandom()) $ i = 1..10^3]: k := 10: cells := [[(i  1)/k, i/k] $ i = 1..k]: stats::csGOFT(data, cells, CDF = f)
The test does not disqualify the sample as being fdistributed.
Indeed, for a uniform deviate Y on
the interval [0, 1] (as produced
by frandom
),
the cumulative distribution function of
is
indeed given by f.
We note that the previous function yields the correct CDF values
for all real arguments. The chosen cell partitioning indicates that
only values from the interval
are
considered. Since stats::csGOFT
just evaluates
the CDF on the cell boundaries to compute the cell probability of
the cell
by f(b)
 f(a),
it suffices to restrict f to
the interval
.
Hence, for the chosen cells, the symbolic expression f =
x^2
can also be used to specify the distribution:
stats::csGOFT(data, cells, CDF = x^2)
delete f, data, k, cells:

The statistical data: real numerical values 

A sample of domain type 

An integer representing a column index of the sample 

Cell boundaries: real numbers satisfying a_{1} < b_{1} ≤ a_{2} < b_{2} ≤ a_{3} < …. Also is admitted as a cell boundary. At least 3 cells have to be specified. 

A procedure representing the hypothesized distribution: either
a cumulative distribution function ( 

This determines how the procedure 
a list of three equations
[PValue = p
, StatValue = s
, MinimalExpectedCellFrequency
= m]
with floatingpoint values p
, s
, m
.
See the "Details" section below for the interpretation
of these values.
In R.B. D'Agostino and M.A. Stephens, "GoodnessOfFit Techniques", Marcel Dekker, 1986, p. 7071, one finds the following recommendations for choosing the cell partitioning:
The number of cells used should be approximately , where n is the sample size.
The cells should have equal probabilities p_{i} under the hypothesized distribution.
With equiprobable cells, the average of the expected
cell frequencies n p_{i} should
be at least 1 when testing
at the significance level α = 0.05.
For α = 0.01, the average
expected cell frequency should be at least 2
. When
cells are not approximately equiprobable, the average expected cell
frequency for the significance levels above should be doubled. For
example, the average expected cell frequency at the significance level α
= 0.01 should be at least 4
.
The distribution function f passed
to stats::csGOFT
via X = f
is
only used to compute the cell probabilities p_{i} = Pr(a_{i} < x ≤ b_{i}) of
the cells
.
A cumulative distribution function f
specified
by CDF
= f
is used to compute
the cell probabilities via p_{i} = f(b_{i})
 f(a_{i}).
A probability density function f
specified
via PDF
= f
is used to compute
the cell probabilities via numerical integration:
.
This is rather expensive!
A discrete probability function specified via PF
= f
is
used to compute the cell probabilities via the summation
.
Note:
Thus, with the specification 
With the specification PF
= f
,
the value  ∞ is not
admitted for the left boundary a_{1} of
the first cell c_{1} = Intval([a_{1}],
[b_{1}]).