4.66667

4.7 | 3 ratings Rate this file 20 Downloads (last 30 days) File Size: 6.27 KB File ID: #41740

Discretization methods: Class-Attribute Contingency Coefficient (CACC - MATLAB)

by

 

14 May 2013 (Updated )

Correct Implementation of the CACC Discretization Method.http://cs.adelaide.edu.au/~jzaragoza

| Watch this File

File Information
Description

This is the correct MATLAB implementation of the discretization method appearing in the paper "A Discretization Algorithm Based on Class-Attribute Contingency Coefficient" by Tsai et al., 2008.

If you tried some other implementations and you don't receive the same results reported in the paper, it is because those implementations are WRONG and in some cases INCOMPLETE.

I tested my code with the data provided in the paper and all of my discretization ranges, CACC values and discretized data are the same as in the paper.

The file 'main.m' contains an example which uses the CACC function for discretizing some data used in the paper.

If you find any bugs in my code please report them so that I can fix them.

Bug #1 squashed! Thanks to Rahul for his comments about Line #156

Required Products MATLAB
MATLAB release MATLAB 7.14 (R2012a)
Tags for This File   Please login to tag files.
Please login to add a comment or rating.
Comments and Ratings (7)
26 Jun 2014 Adrian__

Well done for spotting that Guangdi Li's implementation of the CACC discretization algorithm was wrong. As Rahul pointed out, discretization of large datasets will take a lot of time using your code. This computational burden is a direct consequence of the way the method was coded. I implemented the method in Matlab and achieved (on some test examples) a substantial increase in speed (about 45 times). The speed can be farther improved. At the moment, I don't have the time to write a documentation for it but should you agree, I can let you have my files so you can update your code.

31 Dec 2013 Rahul

That is what I thought, also it seems like your version is O(M^2) where M is the distinct values as you have nested loops when you are adding the inner boundaries. I'm not sure how the paper is achieving O(m log m).

31 Dec 2013 Julio Zaragoza

You need to develop the C/C++ version of the code, otherwise it will take long time

30 Dec 2013 Rahul

This works great on smaller datasets, but have you tried on larger datasets, I'm trying to discretize Gene Expression data, which has 1.5 million samples and 20000 unique classes.

30 Dec 2013 Julio Zaragoza

Yeah it is -1 (n = number of cutting points - 1).
Thanks a lot for your comment, Rahul.

30 Dec 2013 Rahul

Not sure, but in line 181:
yprime = M*(y-1)/log(length(discscheme));

should it not be
yprime = M*(y-1)/log(length(discscheme)-1);

as you want number of intervals ?

28 Jun 2013 Will  
Updates
15 May 2013

Improved description

16 May 2013

Added my webpage's link

31 Dec 2013

Improved code

Contact us