# Kolmogorov Smirnov test help?

55 views (last 30 days)
arthurk on 20 Jan 2020
Edited: Adam Danz on 21 Jan 2020
I have the test data below, the kstest(x) function compares the distribution of the data below against a standard normal distribution (mean of 0 and std of 1). Is it better to simply call the function as kstest(x) or correct the data so that its standard deviation and mean is 1 and 0 respectively?
Also when doing so, do you guys get probability as 0.1267 for uncorrected and 0.6506 for corrected?
It's just that I got significantly different values earlier.
Another question is that are the probabilities realistic? When plotting the values on excel the graphs are more or less normally distributed, however they don't pass the significance level of 5%.
Thanks
1.481336
-0.15023
2.253639
-3.44891
-2.06993
-0.54504
3.077467
-0.49623
-0.23977
0.098674
0.237035
-5.38399
1.753639
-1.65023
0.644677
1.407635
0.077467
-0.66607
1.981336
2.644677
-0.12763
4.035716
-1.18049
-1.04504
0.614422
1.345996
1.224973
-3.49454
-4.23659
0.223383
0.907635
0.724973

Adam Danz on 20 Jan 2020
Edited: Adam Danz on 20 Jan 2020
"Is it better to simply call the function as kstest(x) or correct the data so that its standard deviation and mean is 1 and 0 respectively"
The one-sample Kolmogorov-Smirnov test tests the null hypothesis that the data comes from a standard normal distribution (mean 0, std 1). If you correct your data so that it does have a mean of 0 and std of 1, what's the point of testing it?
If you want a more general test that your data come from a normal distribution with any mean or std, use the Anderson-Darling test or the Lilliefors test.
Null hypotheses (from the documentation)
One-sample Kolmogorov-Smirnov test: the data in vector x comes from a standard normal distribution (mean 0, std 1).
Lilliefors test: the data in vector x comes from a distribution in the normal family.
Anderson-Darling test: the data in vector x is from a population with a normal distribution.
If the null hypothesis is rejected (an outcome of 1 for all three tests), the data do not come from those distributions at a 5% significance level.
Note that if there is a failure to reject the null hypothesis (an outcome of 0 for all three tests), that does not indicate that the data do come from those distributions. This is a common misunderstanding of interpretting hypothesis testing.
Here's a domonstration showing the difference between the kstest and the two other ones.
% Create a data from a normal distributions with
% mean 0 and std 1.
x0 = randn(1,10000);
% Use that same exact data to create a normal distribution
% with mean 5 and std 2
x1 = x0*2 + 5;
% Plot both distributions
clf()
histogram(x0)
hold on
histogram(x1)
Notice how this creates two normal distribtions. The blue distribtuion has a mean of 0 and std of 1 while the reddish distribution has a mean of 5 and std of 2 (approximately).
% Look at the results of the ks-tests
ks0 = kstest(x0) % fail to reject
ks1 = kstest(x1) % reject null hyp
% Look at the results of the Lilliefors test
lt0 = lillietest(x0) % fail to reject
lt1 = lillietest(x1) % fail to reject
% Look at the results from the Anderson-Darling test
As you can see, the blue distribution is identified as a standard normal distribution and rightfully so since it has a mean of 0 and std of 1 (approximately) while the other distribution does not. However, both distributions are normal as indicated by both the lillietest() and adtest().

Show 1 older comment
Adam Danz on 20 Jan 2020
In case you haven't seen the update in my answer, I copy-pasted the wrong comments next to the normality tests. They are corrected now.
When any of the tests output a value of 1 (true) that means you are rejecting the null hypothesis (that the values come from those distributions).
Since the data in my demo are indeed good fits to normal distributions, they fail to reject the null hypothesis except for the kstest() on the reddish distribution which clearly isn't a standard normal distribution.
arthurk on 21 Jan 2020
Hmmm, thank you for the correction.
Your clarification has contradicted my understanding of the null hypothesis, so I hope you can help me clear it up.
The definition of the null hypothesis: A null hypothesis is a type of hypothesis used in statistics that proposes that no statistical significance exists in a set of given observations.
Therefore, I would believe that between a normally distributed data and a crude looking one like the examples you gave. The tests would come back true, because there is relationship between the two. i.e it rejects the null hypothesis
Failing to reject would mean otherwise.
Where have I gone wrong?
Adam Danz on 21 Jan 2020
That definition from Investopedia isn't precise enough. Even wiki's definition isn't precise enough: the null hypothesis is a general statement or default position that there is nothing significantly different happening.
To be more precise, the null hypothesis is typically a test statement that there is no difference in what you're testing at some arbitrary significance level.
The key differences between this definition and other others are
1. The null hypothesis is not a general statement. On the contrary. It's specific to the test you're testing.
2. It doesn't test that "nothing is significant" or "no significance exists". Instead, it only tests whatever specific property that the test is designed to test and it bases the significance on an arbitrary alpha value (ie, p=0.05).
These very common misconceptions are why 100s of scientists, statisticians, and researchers have supported a movement to stop significance testing.
Critically, the null hpothesis is whatever the test deems it to be. Look at the matlab documentation for the kstest: