Does the function ClassificationTree.fit automatically prune?

10 views (last 30 days)
Niklas Axelsson
Niklas Axelsson on 30 Jun 2012
Commented: eli seri on 15 Nov 2017
Dear All,
I am currently trying to construct a classification tree for a variable Y using different explanatory variables X. I want to use CART and therefore try to use the function Classification.Tree.fit(Y,X) in MATLAB.
The thing is that my variable Y has two categories, 's' and 'n', where 'n' is very 'rare', meaning only ~5% of data is of this certain class. This means that the majority of the Ys are of the class 's'.
When constructing the tree, I get about 8-10 levels, where the terminal nodes have very few (or not many) predicted observations. Now, let the grown tree be denoted tree, so if I do the following: [~,~,~,bestLevel]=cvLoss(tree,'subtrees','all');
I get that bestLevel is the root (!), meaning every future predicted value would be of just one class... Could it be that my prediction values in X are bad, or am I doing something very wrong here?
I was also wondering: when constructing the initial tree - does the function Classification.Tree.fit() automatically prune the tree to an "optimal size" before returning it, or does it make a big a tree as possible and leaves this to the user to prune afterwards?
  1 Comment
eli seri
eli seri on 15 Nov 2017
do you know which pruning algorithm is being used (Cost Complexity Pruning, Rule Post Pruning, Pessimist Error Pruning, Error Bases Pruning,....)

Sign in to comment.

Accepted Answer

Ilya
Ilya on 30 Jun 2012
I described strategies for learning on imbalanced data in this post http://www.mathworks.com/matlabcentral/answers/11549-leraning-classification-with-most-training-samples-in-one-category The easiest thing to do is set 'prior' to 'uniform'.
The optimal pruning level could be equal to the largest pruning level for some data. This is not necessarily an indication that something went wrong.
Take a look at 'MergeLeaves' and 'Prune' parameters in the doc for ClassificationTree.fit. The doc for 'Prune' says that ClassificationTree computes the optimal sequence of pruned subtrees. The tree is not pruned; just the optimal sequence is computed. The doc for 'MergeLeaves' says that ClassificationTree merges leaves that originate from the same parent node, and that give a sum of risk values greater or equal to the risk associated with the parent node. That is, ClassificationTree applies a minimal amount of pruning, just for the leaves. If the tree prunes by classification error (default), this amounts to merging leaves that share the most popular class per leaf.
  5 Comments
Andrew
Andrew on 3 Oct 2014
Is this really true? It seems to me the "MergeLeaves" functionality is not doing the same thing as Pruning. Nor does specifying pruning as "on" actually prune the tree: it merely computes the PruneList. Somewhat telling is that your proposed solution does not allow the user to set the prune level in any way. I tried this suggestion and did not get significantly smaller trees using it.
So I have not yet found a way to actually get a bag of pruned trees, either at creation time or after the fact by calling "prune". Your comment about this being unnecessary to improve the ensemble accuracy notwithstanding, perhaps a reason one would want this functionality (as do I) is to create a smaller ensemble with similar performance to the original. Note that I am NOT talking about an ensemble with fewer weak learners in this case, but one in which the individual weak learners' sizes can be pruned, not just limited via MinLeaf/MinParent, which is a different approach.
So it would be nice if the prune method was supported, both on an ensemble and on individual CompactClassificationTrees. (One can produce the latter with a PruneList, using 'prune' set to 'on' as you suggested, so why not provide the capability to actually do the pruning?)

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!