This is the correct MATLAB implementation of the discretization method appearing in the paper "A Discretization Algorithm Based on Class-Attribute Contingency Coefficient" by Tsai et al., 2008.
If you tried some other implementations and you don't receive the same results reported in the paper, it is because those implementations are WRONG and in some cases INCOMPLETE.
I tested my code with the data provided in the paper and all of my discretization ranges, CACC values and discretized data are the same as in the paper.
The file 'main.m' contains an example which uses the CACC function for discretizing some data used in the paper.
If you find any bugs in my code please report them so that I can fix them.
Bug #1 squashed! Thanks to Rahul for his comments about Line #156
Well done for spotting that Guangdi Li's implementation of the CACC discretization algorithm was wrong. As Rahul pointed out, discretization of large datasets will take a lot of time using your code. This computational burden is a direct consequence of the way the method was coded. I implemented the method in Matlab and achieved (on some test examples) a substantial increase in speed (about 45 times). The speed can be farther improved. At the moment, I don't have the time to write a documentation for it but should you agree, I can let you have my files so you can update your code.
That is what I thought, also it seems like your version is O(M^2) where M is the distinct values as you have nested loops when you are adding the inner boundaries. I'm not sure how the paper is achieving O(m log m).
You need to develop the C/C++ version of the code, otherwise it will take long time
This works great on smaller datasets, but have you tried on larger datasets, I'm trying to discretize Gene Expression data, which has 1.5 million samples and 20000 unique classes.
Yeah it is -1 (n = number of cutting points - 1).
Thanks a lot for your comment, Rahul.
Not sure, but in line 181:
yprime = M*(y-1)/log(length(discscheme));
should it not be
yprime = M*(y-1)/log(length(discscheme)-1);
as you want number of intervals ?
Added my webpage's link
Download apps, toolboxes, and other File Exchange content using Add-On Explorer in MATLAB.