Documentation Center |
The Kolmogorov-Smirnov goodness-of-fit test
This functionality does not run in MATLAB.
stats::ksGOFT(x_{1}, x_{2}, …, CDF = f) stats::ksGOFT([x_{1}, x_{2}, …], CDF = f) stats::ksGOFT(s, <c>, CDF = f)
stats::ksGOFT([x_{1}, x_{2}, …], CDF = f) applies the Kolmogorov-Smirnov goodness-of-fit test for the null hypothesis: "x_{1}, x_{2}, … is an f-distributed sample".
External statistical data stored in an ASCII file can be imported into a MuPAD^{®} session via import::readdata. In particular, see Example 1 of the corresponding help page.
An error is raised if any of the data cannot be converted to a real floating-point number.
Let y_{1}, …, y_{n} be the input data x_{1}, …, x_{n} arranged in ascending order. stats::ksGOFT returns the list
containing the following information:
K1 is the Kolmogorov-Smirnov statistic .
p1 is the observed significance level of the statistic K1.
K2 is the Kolmogorov-Smirnov statistic .
p2 is the observed significance level of the statistic K2.
For the Kolmogorov-Smirnov statistic K corresponding to K1 or K2, respectively, the observed significance levels p1, p2 are computed by an asymptotic approximation of the exact probability
.
For large n, these probabilities are approximated by
.
Thus, the observed significance levels returned by stats::ksGOFT approximate the exact probabilities for large n. Roughly speaking, for n = 10, the 3 leading digits of p1, p2 correspond to the exact probabilities. For n = 100, the 4 leading digits of p1, p2 correspond to the exact probabilities. For n = 1000, the 6 leading digits of p1, p2 correspond to the exact probabilities.
The observed significance level PValue1 = p1 returned by stats::ksGOFT has to be interpreted in the following way:
Under the null hypothesis, the probability p1 = Pr(K > K1) should not be small. Specifically, p1 = Pr(K > K1) ≥ α should hold for a given significance level . If this condition is violated, the hypothesis may be rejected at level α.
Thus, if the observed significance level p1 = Pr(K > K1) satisfies p1 < α, the sample leading to the value K1 of the statistic K represents an unlikely event and the null hypotheses may be rejected at level α.
The corresponding interpretation holds for PValue2 = p2: if p2 = Pr(K > K2) satisfies p2 < α, the null hypotheses may be rejected at level α.
Note that both observed significance levels p1, p2 must be sufficiently large to make the data pass the test. The null hypothesis may be rejected at level α if any of the two values is smaller than α.
If p1 and p2 are both close to 1, this should raise suspicion about the randomness of the data: they indicate a fit that is too good.
Distributions that are not provided by the stats-package can be implemented easily by the user. A user defined procedure f can implement any cumulative distribution function; stats::ksGOFT calls f(x) with real floating-point arguments from the data sample. The function f must return a numerical real value between 0 and 1. Cf. Example 3.
The function is sensitive to the environment variable DIGITS which determines the numerical working precision.
We create a sample of 1000 normally distributed random numbers:
r := stats::normalRandom(0, 1, Seed = 123): data := [r() $ i = 1 .. 1000]:
We test whether these data are indeed normally distributed with mean 0 und variance 1. We pass the corresponding cumulative distribution function stats::normalCDF(0, 1) to stats::ksGOFT:
stats::ksGOFT(data, CDF = stats::normalCDF(0, 1))
The result shows that the data can be accepted as a sample of normally distributed numbers: both observed significance levels and are not small.
Next, we inject some further data into the sample:
data := data . [frandom() $ i = 1..100]: stats::ksGOFT(data, CDF = stats::normalCDF(0, 1))
Now, the data should not be accepted as a sample of normal deviates with mean 0 and variance 1, because the second observed significance level PValue2 = 0.000065.. is very small.
delete r, data:
We create a sample consisting of one string column and two non-string columns:
s := stats::sample( [["1996", 1242, PI - 1/2], ["1997", 1353, PI + 0.3], ["1998", 1142, PI + 0.5], ["1999", 1201, PI - 1], ["2001", 1201, PI]])
"1996" 1242 PI - 1/2 "1997" 1353 PI + 0.3 "1998" 1142 PI + 0.5 "1999" 1201 PI - 1 "2001" 1201 PI
We consider the data in the third column. The mean and the variance of these data are computed:
[m, v] := [stats::mean(s, 3), stats::variance(s, 3)]
We check whether the data of the 3rd column are normally distributed with the mean and variance computed above:
stats::ksGOFT(s, 3, CDF = stats::normalCDF(m, v))
Both observed significance levels and returned by the test are not small. There is no reason to reject the null hypothesis that the data are normally distributed.
delete s, m, v:
We demonstrate how user-defined distribution functions can be used. The following function represents the cumulative distribution function Pr(X ≤ x) = x^{2} of a variable X supported on the interval [0, 1]. It will be called with floating-point arguments x and must return numerical values between 0 and 1:
f := proc(x) begin if x <= 0 then return(0) elif x < 1 then return(x^2) else return(1) end_if end_proc:
We test the hypothesis that the following data are f-distributed:
data := [sqrt(frandom()) $ k = 1..10^2]: stats::ksGOFT(data, CDF = f)
At a given significance level of 0.1, say, the hypothesis should not be rejected: both observed significance levels p1 = and p2 = exceed 0.1.
delete f, data:
x_{1}, x_{2}, … |
The statistical data: real numerical values |
f |
A procedure representing a cumulative distribution function. Typically, one of the distribution functions of the stats-package such as stats::normalCDF(n, v) etc. |
s |
A sample of domain type stats::sample |
c |
An integer representing a column index of the sample s. This column provides the data x1, x2 etc. There is no need to specify a column number c if the sample has only one column. |
List with four equations [PValue1 = p1, StatValue1 = K1, PValue2 = p2, StatValue2 = K2], with floating-point values p1, K1, p2, K2. See the "Details" section below for the interpretation of these values.
D. E. Knuth, The Art of Computer Programming, Vol 2: Seminumerical Algorithms, pp. 48. Addison-Wesley (1998).