MATLAB Answers

Feature Selection in TreeBagger

20 views (last 30 days)
Hello MathWorks community
I'm currently working with the TreeBagger class to generate some classification tree esembles. Now I would like to know, how it decides wich features are used for splitting the data. If I create for example an esemble of tree stumps with 5000 trees and use it to classify a dataset with two features (e.g. VRQL-Value and maximum frequency), and then check which feature was selected for splitting for every single tree like this:
cellArray={};
for y=1:length(Random_Forest_Model.Trees)
cellArray{y}=Random_Forest_Model.Trees{y}.CutPredictor{1};
end
It happens in some cases, that only one feature was selected for all 5000 trees and the other feature was selected in not a single case (i.e. cellArray looks like this: {'x2', 'x2', 'x2', ..., 'x2', }). This can also happen with multiple features: only one feature is selected, the others are ignored.
Maybe important things to mention about the dataset:
-One feature achieves Values from 1 to 100, the other one from about 200 to 1200
-The classes are imbalanced (class 1: 52 entries, class 2: over 300 entries)
-only the greater class contains the NaNs
-both features contain NaNs
My question now is: how can I achieve, that the TreeBagger uses all features for classification and not only one or how can I in genreal achieve a more balanced selection of features.

Accepted Answer

Ahmad Obeid
Ahmad Obeid on 21 May 2019
The default setting in TreeBagger for the number of features to sample from the original set of features is ceil(sqrt()).
Why this number specifically? I don't know...
But why is it important to take a subset of the features and not the whole set of features? It's because if you always take the same features (say the whole set of features) you will get highly correlated decision trees in every iteration, and thereby will not be able to cancel out their inherint great varience.
I beleive that the features are sampled in a uniform fashion, which means that if you have many trees, approximately all features should be represented equally over all of the trees.
However, in your case the subset of the features has the same size of the original feature set ( ceil(sqrt(2)) = 2 ). Once the set of features is selected, a certain criterion is used to select which feature should the split be based on. The criteria can be the Gini index, or information gain (entropy).
So my guess is that since you're always ending up with the whole set of features, and everytime the same criterion is used to choose which feature to go with, you're always ending up with the same feature, and the other one is excluded.
  1 Comment
Patrick Schlegel
Patrick Schlegel on 23 May 2019
Thank you for you answer
I have investiaged this further in the meantime and it turns out, that I have a very "mighty" feature that is selected in (almost) all cases if the random forest looks for the best feature to split the data (this is also true if I include more than two features). So it is as you guessed, but I will still try to supplement your explanation by stuff I found out since then.
The flag 'NumPredictorsToSample' decides from how many features the random forest will choose (see also https://de.mathworks.com/help/stats/treebagger.html first table, entry 'NumPredictorsToSample'). Every time the best feature for each node on each tree is chosen from a number of randomly selected features. If I have e.g. 15 features and select " 'NumPredictorsToSample', 3 " the random forest will, as far as I understand, look for the first tree and first tree node e.g. on the features number 3, 7 and 9 and chose the best one from them to split the node. Then it will maybe look on the features 2, 15 and again 9 or just any other three feature combination, split the next node and so on.
My problem was, that I selected a too high number as NumPredictorsToSample and everything got decided by the best feature alone. However, with a lower number of NumPredictorsToSample the out-of-bag error of the forest is conisderably lower (so the "best" feature alone is not enough for achieving the best classification).
I hope I got it right and explained it well enough, that someone maybe stumbeling over this problem in the future will find his answer here

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!