# What is the kstest2 in MATLAB doing to compute the empirical distribution function?

1 view (last 30 days)
Darcy Cordell on 22 Nov 2018
Edited: Darcy Cordell on 22 Nov 2018
According to Wikipeda, to compute the 2-sample Kolmogorov-Smirnov test, you first compute the empirical cumulative distribution function (ECDF) for both samples and then find the maximum difference between the ECDFs. MATLAB includes a built-in function called "ecdf", but the built-in kstest2 does not use it. Instead it uses this:
%
% Calculate F1(x) and F2(x), the empirical (i.e., sample) CDFs.
%
binEdges = [-inf ; sort([x1;x2]) ; inf];
binCounts1 = histc (x1 , binEdges, 1);
binCounts2 = histc (x2 , binEdges, 1);
sumCounts1 = cumsum(binCounts1)./sum(binCounts1);
sumCounts2 = cumsum(binCounts2)./sum(binCounts2);
sampleCDF1 = sumCounts1(1:end-1);
sampleCDF2 = sumCounts2(1:end-1);
where and are your two samples.
Note that this does not give the same result as or . For example, if and both have length , then sampleCDF1 and sampleCDF2 will have a length of whereas and both have length .
It seems really strange to me that the variable used to compute the ecdf is "binCounts1" and binCounts2" which are just vectors of 1s and 0s. What happened to the data?
Can someone explain what MATLAB is doing here and why they don't use the ecdf() function?