Documentation |
Classical chi-square goodness-of-fit test
This functionality does not run in MATLAB.
stats::csGOFT(x_{1}, x_{2}, …, [[a_{1}, b_{1}], [a_{2}, b_{2}], …], CDF = f | PDF = f | PF = f) stats::csGOFT([x_{1}, x_{2}, …], [[a_{1}, b_{1}], [a_{2}, b_{2}], …], CDF = f | PDF = f | PF = f) stats::csGOFT(s, <c>, [[a_{1}, b_{1}], [a_{2}, b_{2}], …], CDF = f | PDF = f | PF = f)
stats::csGOFT(data, cells, CDF = f) applies the classical chi-square goodness-of-fit test for the null hypothesis: "the data are f-distributed".
The chi-square goodness-of-fit test divides the real line into k intervals ('the cells'). It computes the number of data x_{j} falling into the cells c_{i} and compares these 'empirical cell frequencies' with the 'expected cell frequencies' n p_{i}, where n is the sample size and p_{i} = Pr(a_{i} < x ≤ b_{i}) are the 'cell probabilities' of a random variable with the hypothesized distribution specified by X = f.
All data x_{1}, x_{2} etc. must be convertible to real floating-point numbers. The data do not have to be sorted on input: stats::csGOFT automatically converts the data to floats and sorts them internally.
External statistical data stored in an ASCII file can be imported into a MuPAD^{®} session via import::readdata. In particular, see Example 1 of the corresponding help page.
Finite cell boundaries a_{i}, b_{i} must be convertible to real floating-point numbers satisfying a_{1} < b_{1} ≤ a_{2} < b_{2} ≤ a_{3} < …. They define semiopen intervalls .
When the hypothesized distribution f is specified as a cumulative distribution function (CDF = f), the left boundary of the first cell and the right boundary of the last cell are ignored. They are replaced by - ∞ and infinity, respectively, i.e., the cell partitioning
is used internally.
The cells must be disjoint. Their union must cover the support area of the distribution, i.e., the 'cell probabilities' p_{i} = Pr(a_{i} < x ≤ b_{i}) must add up to 1 for a random variable x with the hypothesized distribution given by f. For continuous distributions, adjacent cells with b_{1} = a_{2}, b_{2} = a_{3}, … are appropriate.
You may use a_{1} = - ∞ and b_{k} = ∞ for distributions supported on the entire real line.
See the `Background' section of this help page for recommendations on the cell partitioning. In particular, the use of equiprobable cells (with constant p_{i}) is recommended. For convenience, a utility function stats::equiprobableCells is provided to generate such cells. See Example 1, Example 3, and Example 4.
The distribution the data are tested for is specified by the equation X = f, where X is one of the flags CDF, PDF or PF.
For efficiency, it is recommended to specify a cumulative distribution function (CDF = f).
The function f can be a procedure provided by the MuPAD stats library. Specifications such as CDF = stats::normalCDF(m, v) or CDF = stats::poissonCDF(m) with suitable numerical values of m, v are possible and recommended.
Distributions that are not provided by the stats-package can be implemented easily by the user. A user defined procedure f can implement any distribution function. In the CDF case, stats::csGOFT calls f with the boundary values a_{i}, b_{i} of the cells to compute the cell probabilities via p_{i} = f(b_{i}) - f(a_{i}) (automatically setting f(a_{1}) = 0 and f(b_{k}) = 1).
The function f must return a numerical real value between 0 and 1. See Example 5 and Example 6.
Alternatively, the function f can be specified by a univariate arithmetical expression g(x) depending on a symbolic variable x. It is interpreted as the function . Cf. Example 6.
See the `Background' section of this help page for further information on the specification of the distribution via CDF = f, PDF = f or PF = f.
The call stats::csGOFT(data, cells, X = f) returns the list [PValue = p, StatValue = s, MinimalExpectedCellFrequency = m]:
s is the observed value of the chi-square statistic
,
where n is the sample size, k is the number of cells, y_{i} is the observed cell frequency of the data (i.e., y_{i}is the number of data x_{j} falling into the cell c_{i}), and p_{i} is the cell probabilitiy corresponding to the hypothesized distribution f.
p is the observed significance level of the chi-square statistic with k - 1 degrees of freedom, i.e., p = 1 - stats::chisquareCDF(k - 1)(s)
is the minimum of the expected cell frequencies n p_{i}. This information is provided by the test to make sure that the boundary conditions for a "reasonable" cell partitioning are met (see the "Background" section of this help page).
The most relevant information returned by stats::csGOFT is the observed significance level PValue = p. It has to be interpreted in the following way: Under the null hypothesis, the chi-square statistic
is approximately chi-square distributed (for large samples):
.
Under the null hypothesis, the probability p = Pr(S > s) should not be small, where s is the value of the statistic attained by the sample.
Specifically, p = Pr(S > s) ≥ α should hold for a given significance level 0 < α < 1, If this condition is violated, the hypothesis may be rejected at level α.
Thus, if the PValue (observed significance level) p = Pr(S > s) satisfies p < α, the sample leading to the observed value s of the statistic S represents an unlikely event, and the null hypothesis may be rejected at level α.
On the other hand, values of p close to 1 should raise suspicion about the randomness of the data: they indicate a fit that is too good.
The function is sensitive to the environment variable DIGITS which determines the numerical working precision.
We consider random data that should be normally distributed with mean 15 and variance 2:
f := stats::normalRandom(15, 2, Seed = 0): data := [f() $ i = 1..1000]:
According to the recommendations in the `Background' section of this help page, the number of cells should be approximately , where n = 1000 is the sample size.
We wish to use 32 cells that are equiprobable with respect to the hypothesized normal distribution. We estimate the mean m and the variance v of the data:
[m, v] := [stats::mean(data), stats::variance(data, Sample)]
The utility function stats::equiprobableCells is used to compute an equiprobable cell partitioning via the quantile function of the normal distribution with the empirical parameters:
cells := stats::equiprobableCells(32, stats::normalQuantile(m, v)): stats::csGOFT(data, cells, CDF = stats::normalCDF(m, v))
The observed significance level attained by the sample is not small. Hence, one should not reject the hypothesis that the sample is normally distributed with mean and variance .
In the following, we impurify the sample by appending some uniformly distributed numbers. A new equiprobable cell partitioning appropriate for the new data is computed:
r := stats::uniformRandom(10, 20, Seed = 0): data := append(data, r() $ 40): [m, v] := [stats::mean(data), stats::variance(data, Sample)]: k := round(2*nops(data)^(2/5)): cells := stats::equiprobableCells(k, stats::normalQuantile(m, v)): stats::csGOFT(data, cells, CDF = stats::normalCDF(m, v))
The impure data may be rejected as a normally distributed sample at levels as small as .
delete f, data, m, v, k, cells, r:
We create a sample of random data that should be binomially distributed with trial parameter 70 and probability parameter :
r := stats::binomialRandom(70, 1/2, Seed = 123): data := [r() $ k = 1..1000]:
With the expectation value of 35 and the standard deviation of of this distribution, we expect most of the data to have values between 30 and 40. Thus, a cell partitioning consisting of 12 cells corresponding to the intervals
should be appropriate. Note that all cells are interpreted as the intervals , i.e., the left boundary is not included in the interval. Strictly speaking, the value 0 is not covered by these cells. However, with a CDF specification, stats::csGOFT ignores the leftmost boundary and replaces it by -infinity. Thus, the union of the cells does cover all integers 0, …, 70 that can be attained by the hypothesized binomial distribution with `trial parameter' 70:
cells := [[0, 30], [i, i + 1] $ i = 30..39, [40, 70]]
We apply the χ^{2} test with various specifications of the binomial distribution. They all produce the same result. However, the first call using a CDF specification is the most efficient (fastest) call:
stats::csGOFT(data, cells, CDF = stats::binomialCDF(70, 1/2));
stats::csGOFT(data, cells, PF = stats::binomialPF(70, 1/2));
f := binomial(70, x)*(1/2)^x*(1/2)^(70 - x): stats::csGOFT(data, cells, PF = f)
The observed significance level indicates that the data pass the test well.
Next, we dote the sample by appending the value 35 forty times:
data := data . [35 $ 40]: stats::csGOFT(data, cells, CDF = stats::binomialCDF(70, 1/2));
Now, the data may be rejected as a binomial sample with the specified parameters at levels as small as .
delete r, data, cells, f:
We test data that purport to be a sample of beta distributed numbers with scale parameters 3 and 2. Since beta deviates attain values between 0 and 1, we choose an equidistant cell partitioning of the interval [0, 1] consisting of 10 cells. Various equivalent calls to stats::csGOFT are demonstrated:
r := stats::betaRandom(3, 2, Seed = 1): data := [r() $ i = 1..100]: cells := [[(i - 1)/10, i/10] $ i = 1..10]: stats::csGOFT(data, cells, CDF = stats::betaCDF(3, 2)); stats::csGOFT(data, cells, CDF = (x -> stats::betaCDF(3, 2)(x)))
Alternatively, the beta destribution may be passed by a PDF specification. This, however, is less efficient than the CDF specification used before:
stats::csGOFT(data, cells, PDF = stats::betaPDF(3, 2)); stats::csGOFT(data, cells, PDF = (x -> stats::betaPDF(3, 2)(x)));
The observed significance level is not small. Hence, this test does not indicate that the data should be rejected as a beta distributed sample with the specified parameters. Note, however, that the minimal expected cell frequency given by the third element of the returned list is rather small. This indicates that the cell partitioning is not very fortunate. We investigate the expected cell frequencies by computing n p_{i} = n (f(b_{i}) - f(a_{i})), where f is the cumulative distribution function of the beta distribution and n is the sample size:
f:= stats::betaCDF(3, 2): map(cells, cell -> 100*(f(cell[2]) - f(cell[1])))
These values show that the first two or three cells should be joined to a single cell. We modify the cell partitioning by joining the first three and the last two cells:
cells := [[0, 3/10], [(i - 1)/10, i/10] $ i = 4..8, [8/10, 1]]
For this cell partitioning, the expected frequencies in a random sample of size 100 are sufficiently large for all cells:
map(cells, cell -> 100*(f(cell[2]) - f(cell[1])))
We apply another χ^{2} test with this improved partitioning:
stats::csGOFT(data, cells, CDF = f)
Again, with the observed significance level , the test does not give any hint that the data are not beta distributed with the specified parameters.
Now, we test whether the data can be regarded as being normally distributed. First, we estimate the parameters (mean and variance) required for the normal distribution:
[m, v] := [stats::mean(data), stats::variance(data, Sample)]
The cell partitioning used before was a partitioning of the interval [0, 1], because beta deviates attain values in this interval. Now we construct a partitioning of 7 equiprobable cells using the quantile function of the normal distribution:
k := 7: cells := stats::equiprobableCells(7, stats::normalQuantile(m, v))
Indeed, theses cells are equiprobable:
f:= stats::normalCDF(m, v): map(cells, cell -> f(cell[2]) - f(cell[1]))
We test for normality with the estimated mean and variance:
stats::csGOFT(data, cells, CDF = f)
With the observed significance level of , the data should not be rejected as a normally distributed sample. We note that the nonparametric Shapiro-Wilk test implemented in stats::swGOFT does detect nonnormality of the sample:
stats::swGOFT(data)
With the observed significance level of , normality can be rejected at levels as low as .
delete r, data, cells, f, m, v, k, boundaries:
We demonstrate the use of samples of type stats::sample. We create a sample consisting of one string column and two non-string columns:
s := stats::sample( [["1996", 1242, 156], ["1997", 1353, 162], ["1998", 1142, 168], ["1999", 1201, 182], ["2001", 1201, 190], ["2001", 1201, 190], ["2001", 1201, 205], ["2001", 1201, 210], ["2001", 1201, 220], ["2001", 1201, 213], ["2001", 1201, 236], ["2001", 1201, 260], ["2001", 1201, 198], ["2001", 1201, 236], ["2001", 1201, 245], ["2001", 1201, 188], ["2001", 1201, 177], ["2001", 1201, 233], ["2001", 1201, 270]])
"1996" 1242 156 "1997" 1353 162 "1998" 1142 168 "1999" 1201 182 "2001" 1201 190 "2001" 1201 190 "2001" 1201 205 "2001" 1201 210 "2001" 1201 220 "2001" 1201 213 "2001" 1201 236 "2001" 1201 260 "2001" 1201 198 "2001" 1201 236 "2001" 1201 245 "2001" 1201 188 "2001" 1201 177 "2001" 1201 233 "2001" 1201 270
We consider the data in the third column. The mean and the variance of these data are computed:
[m, v] := float([stats::mean(s, 3), stats::variance(s, 3, Sample)])
We check whether the data of the third column are normally distributed with the empirical mean and variance computed above. We compute an appropriate cell partitioning in the same way as explained in Example 1:
samplesize := s::dom::size(s): k := round(2*samplesize^(2/5)): cells := stats::equiprobableCells(k, stats::normalQuantile(m, v)): stats::csGOFT(s, 3, cells, CDF = stats::normalCDF(m, v))
Thus, the data pass the test.
delete s, m, v, samplesize, k, cells:
We demonstrate how user-defined distribution functions can be used. A die is rolled 60 times. The following frequencies of the scores 1, 2, …, 6 are observed:
score | 1 | 2 | 3 | 4 | 5 | 6 ----------+---+----+---+----+---+-- frequency | 7 | 16 | 8 | 17 | 3 | 9
We test the null hypothesis that the dice is fair. Under this hypothesis, the variable X given by the score of a single roll attains the values 1 through 6 with constant probability . Presently, the stats-package does not provide a discrete uniform distribution, so we implement a corresponding cumulative discrete distribution function f:
f := proc(x) begin if x < 0 then 0 elif x <= 6 then trunc(x)/6 else 1 end_if; end_proc:
We create the data representing the 60 rolls:
data := [ 1 $ 7, 2 $ 16, 3 $ 8, 4 $ 17, 5 $ 3, 6 $ 9]:
We choose a collection of cells, each of which contains exactly one of the integers 1, …, 6:
Wir wählen sodann eine Zellzerlegung, so dass jede Zelle genau eine der ganzen Zahlen 1, …, 6 enthält:
cells := [[i - 1/2, i + 1/2] $ i = 1..6]
stats::csGOFT(data, cells, CDF = f)
At a significance level as small as , the null hypothesis `the dice is fair' should be rejected.
delete f, data, cells:
We give a further demonstration of user-defined distribution functions. The following procedure represents the cumulative distribution function of a variable X supported on the interval . It will be called with values from the cell boundaries and must return numerical values between 0 and 1:
f := proc(x) begin if x <= 0 then return(0) elif x <= 1 then return(x^2) else return(1) end_if end_proc:
We test the hypothesis that the following data are f-distributed. The cells form an equidistant partitioning of the interval [0, 1]:
data := [sqrt(frandom()) $ i = 1..10^3]: k := 10: cells := [[(i - 1)/k, i/k] $ i = 1..k]: stats::csGOFT(data, cells, CDF = f)
The test does not disqualify the sample as being f-distributed. Indeed, for a uniform deviate Y on the interval [0, 1] (as produced by frandom), the cumulative distribution function of is indeed given by f.
We note that the previous function yields the correct CDF values for all real arguments. The chosen cell partitioning indicates that only values from the interval are considered. Since stats::csGOFT just evaluates the CDF on the cell boundaries to compute the cell probability of the cell by f(b) - f(a), it suffices to restrict f to the interval . Hence, for the chosen cells, the symbolic expression f = x^2 can also be used to specify the distribution:
stats::csGOFT(data, cells, CDF = x^2)
delete f, data, k, cells:
x_{1}, x_{2}, … |
The statistical data: real numerical values |
s |
A sample of domain type stats::sample |
c |
An integer representing a column index of the sample s. This column provides the data x_{1}, x_{2} etc. There is no need to specify a column c if the sample has only one column. |
a_{1}, b_{1}, a_{2}, b_{2}, … |
Cell boundaries: real numbers satisfying a_{1} < b_{1} ≤ a_{2} < b_{2} ≤ a_{3} < …. Also is admitted as a cell boundary. At least 3 cells have to be specified. |
f |
A procedure representing the hypothesized distribution: either a cumulative distribution function (CDF = f), a probability density function (PDF = f), or a (discrete) probability function (PF = f). Typically, f is one of the distribution functions of the stats package such as stats::normalCDF(m, v) etc. Instead of a procedure, also an arithmetical expression in some indeterminate x may be specified which will be interpreted as a function of x. |
a list of three equations
[PValue = p, StatValue = s, MinimalExpectedCellFrequency = m]
with floating-point values p, s, m. See the "Details" section below for the interpretation of these values.
In R.B. D'Agostino and M.A. Stephens, "Goodness-Of-Fit Techniques", Marcel Dekker, 1986, p. 70-71, one finds the following recommendations for choosing the cell partitioning:
The number of cells used should be approximately , where n is the sample size.
The cells should have equal probabilities p_{i} under the hypothesized distribution.
With equiprobable cells, the average of the expected cell frequencies n p_{i} should be at least 1 when testing at the significance level α = 0.05. For α = 0.01, the average expected cell frequency should be at least 2. When cells are not approximately equiprobable, the average expected cell frequency for the significance levels above should be doubled. For example, the average expected cell frequency at the significance level α = 0.01 should be at least 4.
The distribution function f passed to stats::csGOFT via X = f is only used to compute the cell probabilities p_{i} = Pr(a_{i} < x ≤ b_{i}) of the cells .
A cumulative distribution function f specified by CDF = f is used to compute the cell probabilities via p_{i} = f(b_{i}) - f(a_{i}).
A probability density function f specified via PDF = f is used to compute the cell probabilities via numerical integration: . This is rather expensive!
A discrete probability function specified via PF = f is used to compute the cell probabilities via the summation .
Note: Thus, with the specification PF = f, the distribution is implicitly supposed to be supported on the integers in the cells . Do not use PF = f if the discrete probability function is not supported on the integers! Use CDF = f with an appropriate (discrete) cumulative distribution function instead! |
With the specification PF = f, the value - ∞ is not admitted for the left boundary a_{1} of the first cell c_{1} = Intval([a_{1}], [b_{1}]).