The KolmogorovSmirnov goodnessoffit test
This functionality does not run in MATLAB.
stats::ksGOFT(x_{1}, x_{2}, …
,CDF = f
) stats::ksGOFT([x_{1}, x_{2}, …]
,CDF = f
) stats::ksGOFT(s
, <c
>,CDF = f
)
stats::ksGOFT
([x_{1}, x_{2},
…], CDF = f) applies
the KolmogorovSmirnov goodnessoffit test for the null hypothesis:
"x_{1}, x_{2},
… is an f
distributed
sample".
External statistical data stored in an ASCII file can be imported
into a MuPAD^{®} session via import::readdata
. In particular, see
Example 1 of the corresponding
help page.
An error is raised if any of the data cannot be converted to a real floatingpoint number.
Let y_{1},
…, y_{n} be
the input data x_{1},
…, x_{n} arranged
in ascending order. stats::ksGOFT
returns the list
containing the following information:
K1
is the KolmogorovSmirnov statistic
.
p1
is the observed significance
level
of the statistic K1
.
K2
is the KolmogorovSmirnov statistic
.
p2
is the observed significance
level
of the statistic K2
.
For the KolmogorovSmirnov statistic K corresponding
to K1 or K2,
respectively, the observed significance levels p1
, p2
are
computed by an asymptotic approximation of the exact probability
.
For large n, these probabilities are approximated by
.
Thus, the observed significance levels returned by stats::ksGOFT
approximate
the exact probabilities for large n.
Roughly speaking, for n = 10,
the 3 leading digits of p1
, p2
correspond
to the exact probabilities. For n =
100, the 4 leading digits of p1
, p2
correspond
to the exact probabilities. For n =
1000, the 6 leading digits of p1
, p2
correspond
to the exact probabilities.
The observed significance level PValue1 = p1
returned
by stats::ksGOFT
has to be interpreted in the following
way:
Under the null hypothesis, the probability p1 = Pr(K > K1) should not be small. Specifically, p1 = Pr(K > K1) ≥ α should hold for a given significance level . If this condition is violated, the hypothesis may be rejected at level α.
Thus, if the observed significance level p1
= Pr(K > K1) satisfies p1 <
α, the sample leading to the value K1
of
the statistic K represents
an unlikely event and the null hypotheses may be rejected at level α.
The corresponding interpretation holds for PValue2
= p2
: if p2 = Pr(K > K2)
satisfies p2 <
α, the null hypotheses may be rejected
at level α.
Note that both observed significance levels p1
, p2
must
be sufficiently large to make the data pass the test. The null hypothesis
may be rejected at level α if
any of the two values is smaller than α.
If p1 and p2 are both close to 1, this should raise suspicion about the randomness of the data: they indicate a fit that is too good.
Distributions that are not provided by the stats
package
can be implemented easily by the user. A user defined procedure f can
implement any cumulative distribution function; stats::ksGOFT
calls f(x) with
real floatingpoint arguments from the data sample. The function f must
return a numerical real value between 0 and 1.
Cf. Example 3.
The function is sensitive to the environment variable DIGITS
which
determines the numerical working precision.
We create a sample of 1000 normally distributed random numbers:
r := stats::normalRandom(0, 1, Seed = 123): data := [r() $ i = 1 .. 1000]:
We test whether these data are indeed normally distributed with
mean 0 und variance 1.
We pass the corresponding cumulative distribution function stats::normalCDF(0, 1)
to stats::ksGOFT
:
stats::ksGOFT(data, CDF = stats::normalCDF(0, 1))
The result shows that the data can be accepted as a sample of normally distributed numbers: both observed significance levels and are not small.
Next, we inject some further data into the sample:
data := data . [frandom() $ i = 1..100]: stats::ksGOFT(data, CDF = stats::normalCDF(0, 1))
Now, the data should not be accepted as a sample of normal deviates
with mean 0 and variance 1,
because the second observed significance level PValue2 =
0.000065..
is very small.
delete r, data:
We create a sample consisting of one string column and two nonstring columns:
s := stats::sample( [["1996", 1242, PI  1/2], ["1997", 1353, PI + 0.3], ["1998", 1142, PI + 0.5], ["1999", 1201, PI  1], ["2001", 1201, PI]])
"1996" 1242 PI  1/2 "1997" 1353 PI + 0.3 "1998" 1142 PI + 0.5 "1999" 1201 PI  1 "2001" 1201 PI
We consider the data in the third column. The mean and the variance of these data are computed:
[m, v] := [stats::mean(s, 3), stats::variance(s, 3)]
We check whether the data of the 3rd column are normally distributed with the mean and variance computed above:
stats::ksGOFT(s, 3, CDF = stats::normalCDF(m, v))
Both observed significance levels and returned by the test are not small. There is no reason to reject the null hypothesis that the data are normally distributed.
delete s, m, v:
We demonstrate how userdefined distribution functions can be used. The following function represents the cumulative distribution function Pr(X ≤ x) = x^{2} of a variable X supported on the interval [0, 1]. It will be called with floatingpoint arguments x and must return numerical values between 0 and 1:
f := proc(x) begin if x <= 0 then return(0) elif x < 1 then return(x^2) else return(1) end_if end_proc:
We test the hypothesis that the following data are fdistributed:
data := [sqrt(frandom()) $ k = 1..10^2]: stats::ksGOFT(data, CDF = f)
At a given significance level of 0.1,
say, the hypothesis should not be rejected: both observed significance
levels p1
=
and p2
=
exceed 0.1.
delete f, data:

The statistical data: real numerical values 

A procedure representing a cumulative
distribution function. Typically, one of the distribution functions
of the 

A sample of domain type 

An integer representing a column index of the sample 
List with four equations [PValue1 = p1
, StatValue1
= K1
, PValue2 = p2
, StatValue2
= K2]
, with floatingpoint values p1
, K1
, p2
, K2
.
See the "Details" section below for the interpretation
of these values.
D. E. Knuth, The Art of Computer Programming, Vol 2: Seminumerical Algorithms, pp. 48. AddisonWesley (1998).