An alternative to the Matlab Treebagger class written in C++ and Matlab.
Creates an ensemble of cart trees (Random Forests). The code includes an implementation of cart trees which are
considerably faster to train than the matlab's classregtree.
Could anyone give an example of how to use this function, I mean the input parameters,.... I successfully built it, so if anyone could please advise. Thanks
These are the erros and Warings when I compile 'mx_compile_cartree'
mx_compile_cartree
GBCC.cpp
GBCC.cpp(24) : error C2563: mismatch in formal parameter list
GBCC.cpp(24) : error C2568: '=' : unable to resolve function overload
C:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\math.h(567): could be 'long double log(long double)'
C:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\math.h(519): or 'float log(float)'
C:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\math.h(121): or 'double log(double)'
GBCC.cpp(24) : error C2143: syntax error : missing ';' before 'constant'
GBCC.cpp(24) : error C2064: term does not evaluate to a function taking 1 arguments
GBCC.cpp(25) : warning C4244: '=' : conversion from 'double' to 'int', possible loss of data
GBCC.cpp(43) : warning C4244: '=' : conversion from 'double' to 'int', possible loss of data
GBCC.cpp(108) : warning C4244: '=' : conversion from 'double' to 'int', possible loss of data
GBCC.cpp(115) : error C3861: 'log2': identifier not found
GBCC.cpp(115) : error C3861: 'log2': identifier not found
GBCC.cpp(127) : warning C4244: '=' : conversion from 'double' to 'int', possible loss of data
GBCC.cpp(151) : error C3861: 'log2': identifier not found
GBCC.cpp(151) : error C3861: 'log2': identifier not found
GBCC.cpp(152) : error C3861: 'log2': identifier not found
GBCC.cpp(152) : error C3861: 'log2': identifier not found
C:\PROGRA~1\MATLAB\R2012B\BIN\MEX.PL: Error: Compile of 'GBCC.cpp' failed.
Error using mex (line 206)
Unable to complete successfully.
Error in mx_compile_cartree (line 8)
mex -O best_cut_node.cpp GBCR.cpp GBCP.cpp GBCC.cpp
I use the version 2012(MATLAB) and VC++(2008) .
These are the erros and Warings when I compile 'mx_compile_cartree'
lcc preprocessor warning: .\node_cuts.h:8 best_cut_node.cpp:2 No newline at end of file
lcc preprocessor warning: best_cut_node.cpp:60 No newline at end of file
Error best_cut_node.cpp: .\node_cuts.h: 2 redeclaration of `GBCC' previously declared at .\node_cuts.h 1
Error best_cut_node.cpp: .\node_cuts.h: 5 redeclaration of `GBCR' previously declared at .\node_cuts.h 4
Error best_cut_node.cpp: .\node_cuts.h: 8 redeclaration of `GBCP' previously declared at .\node_cuts.h 7
Error best_cut_node.cpp: 35 type error in argument 5 to `GBCC'; found `int' expected `pointer to double'
Error best_cut_node.cpp: 35 type error in argument 7 to `GBCC'; found `pointer to double' expected `int'
Error best_cut_node.cpp: 35 insufficient number of arguments to `GBCC'
Error best_cut_node.cpp: 38 type error in argument 5 to `GBCP'; found `int' expected `pointer to double'
Error best_cut_node.cpp: 38 type error in argument 7 to `GBCP'; found `pointer to double' expected `int'
Error best_cut_node.cpp: 38 insufficient number of arguments to `GBCP'
Error best_cut_node.cpp: 41 type error in argument 5 to `GBCR'; found `int' expected `pointer to double'
Error best_cut_node.cpp: 41 type error in argument 6 to `GBCR'; found `pointer to double' expected `int'
Error best_cut_node.cpp: 41 insufficient number of arguments to `GBCR'
Error best_cut_node.cpp: 59 undeclared identifier `delete'
Error best_cut_node.cpp: 59 illegal expression
Error best_cut_node.cpp: 59 syntax error; found `method' expecting `]'
Error best_cut_node.cpp: 59 type error: pointer expected
Warning best_cut_node.cpp: 59 Statement has no effect
Error best_cut_node.cpp: 59 syntax error; found `method' expecting `;'
Warning best_cut_node.cpp: 59 Statement has no effect
Warning best_cut_node.cpp: 59 possible usage of delete before definition
17 errors, 5 warnings
C:\PROGRA~1\MATLAB\R2012B\BIN\MEX.PL: Error: Compile of 'best_cut_node.cpp' failed.
Error using mex (line 206)
Unable to complete successfully.
Hi Leo，Thanks for your help，But now I have another erro！
??? Undefined function or method 'best_cut_node' for input arguments of type 'char'.
Error in ==> cartree at 84
[bestCutVar bestCutValue] = ...
Error in ==> Stochastic_Bosque at 48
Random_ForestT = cartree(Data(TDindx,:),Labels(TDindx), ...
It seems like mx_compile_cartree.m compiled failed. Exactly it was this commend:mex -O best_cut_node.cpp GBCR.cpp GBCP.cpp GBCC.cpp; failed.So why??Thanks again.
leo
I paste the code so you can give me adives.Thank you.
load diabetes
Data = diabetes.x;
Labels = diabetes.y;
Random_Forest = Stochastic_Bosque(Data,Labels);
I get this error:
??? Undefined function or method 'cartree' for input arguments of type 'double'.
Error in ==> Stochastic_Bosque at 48
Random_ForestT = cartree(Data(TDindx,:),Labels(TDindx), ...
Could you kindly tell me why??Thanks very much.
Hi,Leo.I am afraid this package can not handle the categorical feature.So how could I update these code to handle these dataset with categorical features?
Kindly guide me.
Thanks.
Yes the elements of the vector "nodeCutVar" are feature indexes. You can retrieve the tree structure from the field RETree.childnode. For a node i the indexes of the child nodes are RETree.childnode(i) and RETree.childnode(i) + 1, for the left and right child.
hai leo..
im just new in matlab and would like to explore more about random forest. but im not understand most of them.
function Random_Forest = Stochastic_Bosque(Data,Labels,varargin)
data is refer to my data.
what is for labels and varargin?
Very good software, thank you for your effort!
I was wondering whether a replace with replacement is really implemented in this method as the documentation says.
When you make (lines 43-44):
TDindx = round(numel(Labels)*rand(n,1)+.5);
(NOTE: why not using 'randi' function?)
you get 'n' indexes and THEN you make:
TDindx = unique(TDindx);
removing all the duplicates (or more)!!!
Is it correct?
Hi,
why do you call
TDindx = unique(TDindx);
when creating the forest?
I was under the impression that the use of bagging would improve the generalization abilities of the model, but through the call of unique, we are getting rid of all multiple instances. Why did you chose to not use bagging, but rather use subsets of the original data?
Impressive.
Hi all,
Some above said that the package failed to be complied in windows. I found that probably it is because in GBCC.cpp, log2() is not a C standard function. A feasible solution is to replace log2(N) with log((double)n)/log(double(2)).
Thanks again for the code sharing.
Sorry for the late reply. Unfortunately I dont have a windows machine to try this out. I am suprised it wont compile under windows though. In Ubuntu it compiles using gcc.
Anyway if you paste the errors here maybe I can help out a bit more.
Hi Leo, thanks for sharing this code. I have difficulty mexing the cpp files. Do i need a special compiler? when I get to this line:
mex -O best_cut_node.cpp GBCR.cpp GBCP.cpp GBCC.cpp
i receive so many errors in node_cuts.h and it finally says: Compile of 'best_cut_node.cpp' failed.
I'm using R2007b on win32.
could you help?
thanks,
kourosh
Hi - the zip file that I can download now has the same creation date as the old - requires all the changes I made in order to run it, and performs the same as well - please advise
Hi - Will you be posting the new code? - I would like to try it out on regression - Right now, when I use the old code on the classic Boston Housing data set, I get all NaNs. I would like to see if this problem disappears in the code you fixed.
You would have to show me exactly what line 99 is in your code. In my code I do not get any errors. I suspect you have altered the code and inadvertently added or omitted a parenthesis.
Hi Leo,
I am following your's and Mohammad's threads and I am getting following errors after doing your indicated amendments.
>> Random_Forest = Stochastic_Bosque(data,label);
??? Error: File: cartree.m Line: 99 Column: 27
Expression or statement is incorrect--possibly unbalanced (, {, or [.
Error in ==> Stochastic_Bosque at 45
Random_ForestT = cartree(Data(TDindx,:),Labels(TDindx), ...
Thanks a lot for all your help. I found the bug that was causing the difference in performance (accuracy-wise), a part of the code (erroneously) implicitly assumed that features values were distinct.
It is now fixed. And results on the Glass dataset are equivalent to the results you quote for the google code.
Regarding speed, the code seems to run considerably faster on my PC but nowhere near as fast as the google code, which is to be expected as the google code is written almost entirely in C/C++
I have also removed the dependencies from the statistics toolbox using your suggestions (thanks!).
Dependence on internal.stats.getargs has also been removed
Hi - I have followed your suggestion to compare the results of your code versus the "google code". This google code is at
http://code.google.com/p/randomforest-matlab/
This is a Matlab (and Standalone application) port for the excellent machine learning algorithm `Random Forests' - By Leo Breiman et al. from the R-source by Andy Liaw et al. http://cran.r-project.org/web/packages/randomForest/index.html ( Fortran original by Leo Breiman and Adele Cutler, R port by Andy Liaw and Matthew Wiener.) Current code version is based on 4.5-29 from source of randomForest package by Abhishek Jaiantilal.
Against the "glass" data set here are the statistics for 10 and 100 trees, withholding the 35% of the data as you had done.
For RandomBosque, the results were:
Elapsed time for 1000 runs: 1648.743 seconds
Average number correct with 35% samples held out: 0.636 for 10 trees 0.684 for 100 trees
Standard deviation correct with 35% samples held out: 0.076 for 10 trees 0.070 for 100 trees
For class_RFtrain and classRFpredict, the results were:
Elapsed time for 1000 runs: 88.021 seconds
Average number correct with 35% samples held out: 0.722 for 10 trees 0.758 for 100 trees
Standard deviation correct with 35% samples held out: 0.051 for 10 trees 0.048 for 100 trees
I use a MACAIR with MATLAB 2011a and OS 10.6.7.
I was surprised at the runtime differences and the differences in the statistics.
My calls to Randombosque look as follows:
tic;
correct = zeros(1000,2);
for i = 1:length(correct);
M = length(Labels);
m = round(.65*M);
intraining = randperm(M);
intraining = sort(intraining(1:m));
notintraining = setdiff([1:M],intraining);
if rem(i,25) == 1
fprintf('Iteration: %3.0f\n',i);
end
end
toc;
fprintf('Elapsed time for %3.0f runs: %5.3f seconds\n',length(correct),toc)
fprintf('Average number correct with 35%% samples held out: %5.3f for 10 trees %5.3f for 100 trees \n',mean(correct));
fprintf('Standard deviation correct with 35%% samples held out: %5.3f for 10 trees %5.3f for 100 trees\n',std(correct));
Unfortunately it is quite hard to figure out what the problem is without more specific feedback. On top of this, the getargs function is not my code so I am not that familiar with how it works (or how it can fail).
Perhaps what you could do is remove that line of code all together and hard code the parameters. For example in the case of the cartree function make it :
Unfortunately I dont have the google code installed to compare, but I ran comparisons with matlab's TreeBagger (glass data, 140/74 split, 10 trees) and got similar results for the two methods (my code seems to give better results though I am not sure why).
Thanks for all the feedback. You make some very good suggestions which I will try to incorporate soon, especially concerning the randsample dependency (which hadnt crossed my mind).
For the datasets you are testing on : I tested on Glass with 10 trees and got ~=72% accuracy on a 140/74 split. Could you report what splits, number of trees you are using and what accuracies you are getting with this code and the "google" code?
I have been running my modified code and comparing the results with the version on
http://code.google.com/p/randomforest-matlab/
The results of the present package that I modified as above, against the google code do not agree well. I am using classical datasets such as glass (classification) and boston housing (regression), and the google code has a much higher degree of accuracy. I would be grateful if anyone could share their experience on using these classical data sets to see whether they see the same result in their implementations. The boston data set is at http://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html and the glass data set is at http://archive.ics.uci.edu/ml/datasets/Glass+Identification
This package was extremely useful. I should say to all that I am just a new student in this field and my comments reflect my interest in learning more, having a toolbox that is accessible, and one that actually works without days of effort. I must say that I have managed to get the other RandomForest implementations (Google code etc...) up and running but only with considerable difficulty owing to mex compilation issues. I did not have this particular difficulty with this package and as a result was delighted.
This package could be improved if it were accompanied by a demonstration file, some instruction on how to build the package and link the paths, and had eliminated the dependency on the randsample statistic toolbox routine, which some users do not have.
After modification of a few lines, the calls to randsample can be replaced, I believe. For instance the call in Random_Bosque:
There may be limitations to using these substitutions when M is large, but I was very pleased with the speed of the entire package.
The author's suggestions to replace the internal.stats.getargs with calls to getargs were entirely successful.
On a MAC, the cpp programs mex'd without difficulty. I found it expedient to simply move the mx_eval_cartree.mexmaci64 and the best_cut_node.mexmaci64 and the weighted_hist.m files to the folder containing Stochastic_Bosque.m than to adjust paths.
I used the irisdata as a demonstration. It is short and uncomplicated. It is available from http://en.wikipedia.org/wiki/Iris_flower_data_set
Just copy the data out and place it into an mfile. I put the data into a matrix called Data. To try out the Stochastic Bosque routines, I then wrote
As I am a beginner, and was operating without a license on the author's source code! I thought it useful to subsample the iris data set so that I would have a test set against which to examine the performance of the Random_Forest. While this was unnecessary from a theoretical standpoint, I thought it was worthwhile from the standpoint of checking that my modifications to the source were not ruinous.
and I was pleasantly pleased to see that the correctlyclassified measured compared favorably with the original
I should have also liked to see some proximity measures and permutation importance measures present, I speculate that perhaps these were eliminated to produce a package that ran swiftly. At any rate, I shall try to make these myself, because it seems to me that I can write a wrapper and call the Stochastic_Bosque to make my own calculations. If the author would care to offer any further suggestions or caveats, I would like to hear them because I think that his work is useful and can be extended.
The line 45 you refer to has to do with the subsampling of data samples not the features. Each tree is trained using a different subset of the training data.