I've noticed a very curious result of the treebagger, and was wondering if you have had experience with this.
I send a matrix X which has 5 columns of a variety of accounting data. I also send Y, which is a vector of credit ratings. I have about 3000 rows of data.
I understand that the bankruptcy academic community have done extensive research to determine optimal coefficients to predict bankruptcy. I thought it would be interesting to see what the impacts of the coefficients are on the bagging results. So I created a vector, coeff, and multiplied each row of X by the parameters in coeff.
for example: Altman's coefficients are {0.717 0.847 3.107 0.42 0.998}
Curiously, varying the coefficients has NO effect on the oobErrors. I've run exhaustive loops to vary all of the coefficients to track this down.
It seems like in the case of the limit where the coefficient goes to zero, there should be an effect.
Thank you Michael for the clarification on the use of the predict vs oobPredict functions. I am aware of the benefits of the treebagger in automatically splitting the data, I just wanted to be able to see the individual bits.
Your point about the histogram is well taken, but I do think histograms have some value in seeing the data distribution before and after, even if it doesn't provide insight into the performance of the bagger. The various methods you highlighted in your code are great for this.
I am working with real credit data, so being a visual person, I like to see what the ratings distribution looks like at the beginning/end of the process. I should think it would be curious if the classified data had a completely different distribution.
Thanks again for your demo and webinar, I'm finding it incredibly helpful.
Hi Michael,
I am trying to add lines to the code to plot histograms of ratings for three sets of observations.
I would like to see three different histograms as a result of calling the treebagger: ground truth ratings, training ratings, predicted ratings.
Would you be able to confirm that I have implemented the code correctly and advise on the possibility of the 2nd? I have added my own data file with different ratings, but using the same ideas.
1) Ground Truth Ratings
hist(Y)
2) Training Ratings
Can't seem to find the matrix that stores these ratings.
3) Predicted Ratings: Out-of-bag predictions made within the treebagger routine.
The documentation says that the routine automatically partitions the data as training and to be predicted.
Y_Pred = oobPredict(b);
Y_Pred_Num=ordinal(Y_Pred,[],{'AAA' 'AA+' 'AA' 'AA-' 'A+' 'A' 'A-'...
'BBB+' 'BBB' 'BBB-' 'BB+' 'BB' 'BB-' 'B+' 'B' 'B-'...
'CCC+' 'CCC' 'CCC-' 'CC' 'C' 'D'});
figure(4);
hist(Y_Pred_Num);
xlabel('Ratings');
ylabel('Out of Bag Occurrences');
title('Out of Bag Prediction Results');
Thank you for the tip on curly braces. It makes sense now. It's very interesting to be able to graphically see the various trees.
On a side note, perhaps I am overlooking some features of the standard MATLAB window to view the decision trees leaves, but it would be extremely helpful to be able to:
a) read all of the node/leaf labels. Some of them currently get overwritten.
b) cut/paste diagram so it could be manipulated/printed in powerpoint.
c) write in the window to make comments
This is a great demo! I am interested in "viewing" some of the individual decision trees generated by the treebagger function. When I look at
b.Trees(1)
ans =
[1x1 classregtree]
the fact that this is associated with the classregtree leads me to believe that I should be able to call the "view" command on this variable. The Iris Data examples show this to be the case. For example:
a = b.Trees(1)
view(a)
But I get an error statement:
??? Error using ==> view at 37
Invalid input arguments
Could you please advise if there is a way to plot individual decision trees as a result of the treebagger function?
I can't import the database file. I think my matlab can only import the JDBC files. I use Mac OS by the way.
I would appreciate any help. I am new at this
Michelle-
These results are entirely consistent with how classification trees work. Simply rescaling each of the inputs by multiplying them with different coefficients should have no effect on the tree.
For exactly why this is, I'd recommend Breiman's book (which is referenced in the doc), but the short answer is that trees sort each predictor's observations and try a candidate split within each of the gaps. The tree will then select the split that gives the "best" splitting criterion (and that's an entirely different discussion). Scaling the predictor only serves to scale this process, but it doesn't fundamentally change the results.
As an example: suppose we have a simple set of obervations where the predictor has been measured at 1, 2, 4, and 10. The tree will try splits at 1.5, 3, and 7. Let's say that the "best" split is at 7.
Now we go ahead and rescale this input-- mulitply it by 100 or some other coefficient. Now, the tree tries splits at 150, 300, and 700, and it will still select the split at 700. Rescaling doesn't change anything.
Now, if we were to cleverly create _new_ predictors out of a well-chosen combination (linear or otherwise) of our existing predictors, then that certainly would change the tree's performance. For instance, make a 6th predictor in your X from Altman's coefficient's times your original X-- then you might get some interesting results.
Hi Michael,
I've noticed a very curious result of the treebagger, and was wondering if you have had experience with this.
I send a matrix X which has 5 columns of a variety of accounting data. I also send Y, which is a vector of credit ratings. I have about 3000 rows of data.
I understand that the bankruptcy academic community have done extensive research to determine optimal coefficients to predict bankruptcy. I thought it would be interesting to see what the impacts of the coefficients are on the bagging results. So I created a vector, coeff, and multiplied each row of X by the parameters in coeff.
for example: Altman's coefficients are {0.717 0.847 3.107 0.42 0.998}
Curiously, varying the coefficients has NO effect on the oobErrors. I've run exhaustive loops to vary all of the coefficients to track this down.
It seems like in the case of the limit where the coefficient goes to zero, there should be an effect.
thanks for any insights.
Comment only