File Exchange

image thumbnail

Random Forest

version 1.7 (16.1 KB) by

Creates an ensemble of cart trees similar to the matlab TreeBagger class.

4.52381
30 Ratings

107 Downloads

Updated

View License

An alternative to the Matlab Treebagger class written in C++ and Matlab.

Creates an ensemble of cart trees (Random Forests). The code includes an implementation of cart trees which are
considerably faster to train than the matlab's classregtree.

Compiled and tested on 64-bit Ubuntu.

Comments and Ratings (94)

@Yogesh and any other users experiencing the error "Undefined function or variable 'best_cut_node'."
Make sure you compile the code first with "Stochastic_Bosque/cartree/mex_files/mk_compile_cartree" .

Appreciate the choice of names in your code (bosque in stochastic bosque)

@Austin Jordan
node_cut_var : means for a particular node, on which feature would you cut.
node_cut_value : means what would be the value of the feature where you would cut.
child_node: contains the index of the child nodes of the given parent node. In case you need a documentation for cartree.m you can reach me at sunit140995@gmail.com.

Hi Leo , suppose data is of the form
Data Labels
1 1 1
1 2 1
1 3 2
1 4 2
than should bcvar and bcval not have been 2 abd 2.5 respectively ? I feel you have not incorporated this fact in your programme.

hello,, when i ran Random Forest Function file with data set.... come a error...
Undefined function or variable 'best_cut_node'.

Error in cartree (line 84)
[bestCutVar bestCutValue] = ...

Error in Stochastic_Bosque (line 48)
Random_ForestT = cartree(Data(TDindx,:),Labels(TDindx), ...
please help for this error

Austin Jordan

For those trying to visualize this data, if forest = Stochastic_Bosque(data,labels) and children = forest(1).childnode...

labels = tree.nodelabel;

children = tree.childnode;
nodes = zeros(1,size(children,1));

for i = 1:size(children,1)
if children(i) ~= 0
nodes(children(i)) = i;
nodes(children(i)+1) = i;
end
end

treeplot(nodes)
hold on

[x,y] = treelayout(nodes);

for i = 1:length(x)
if tree.nodelabel(i) ~= 0
text(x(i),y(i),num2str(labels(i)),'HorizontalAlignment','center')
end
end

hold off

Austin Jordan

Can someone explain what the variables are? nodeCutVar, nodeCutVal, childnode

How might I visualize this data?

Chris Lu

gjwolf

gjwolf (view profile)

try

Justin Igwe

Hello, I just downloaded this code, Can someone please assist in guiding how to use it. i cannot find the description.

@Akshay
This should help:
http://stackoverflow.com/questions/758001/log2-not-found-in-my-math-h
(inset log2 function in GBCC.cpp)

Hi, when I tried compiling the required mex files,the following error came up.Can someone help me address this issue

mx_compile_cartree
Building with 'Microsoft Windows SDK 7.1 (C++)'.

Error using mex

GBCC.cpp
C:\Users\Preejith\Downloads\dtw\Stochastic_Bosque\cartree\mx_files\GBCC.cpp(24) : error C3861: 'log2': identifier not found
C:\Users\Preejith\Downloads\dtw\Stochastic_Bosque\cartree\mx_files\GBCC.cpp(25) : warning C4244: '=' : conversion from 'double'
to 'int', possible loss of data
C:\Users\Preejith\Downloads\dtw\Stochastic_Bosque\cartree\mx_files\GBCC.cpp(43) : warning C4244: '=' : conversion from 'double'
to 'int', possible loss of data
C:\Users\Preejith\Downloads\dtw\Stochastic_Bosque\cartree\mx_files\GBCC.cpp(108) : warning C4244: '=' : conversion from 'double'
to 'int', possible loss of data
C:\Users\Preejith\Downloads\dtw\Stochastic_Bosque\cartree\mx_files\GBCC.cpp(115) : error C3861: 'log2': identifier not found
C:\Users\Preejith\Downloads\dtw\Stochastic_Bosque\cartree\mx_files\GBCC.cpp(115) : error C3861: 'log2': identifier not found
C:\Users\Preejith\Downloads\dtw\Stochastic_Bosque\cartree\mx_files\GBCC.cpp(127) : warning C4244: '=' : conversion from 'double'
to 'int', possible loss of data
C:\Users\Preejith\Downloads\dtw\Stochastic_Bosque\cartree\mx_files\GBCC.cpp(151) : error C3861: 'log2': identifier not found
C:\Users\Preejith\Downloads\dtw\Stochastic_Bosque\cartree\mx_files\GBCC.cpp(151) : error C3861: 'log2': identifier not found
C:\Users\Preejith\Downloads\dtw\Stochastic_Bosque\cartree\mx_files\GBCC.cpp(152) : error C3861: 'log2': identifier not found
C:\Users\Preejith\Downloads\dtw\Stochastic_Bosque\cartree\mx_files\GBCC.cpp(152) : error C3861: 'log2': identifier not found

Error in mx_compile_cartree (line 8)
mex -O best_cut_node.cpp GBCR.cpp GBCP.cpp GBCC.cpp

Hello, I tried the code but it gives me an error using the varargin in Random Forest= (..,..,varargin) as follows:
Attempt to execute SCRIPT varargin as a function:
C:\Program Files\MATLAB\R2016a\toolbox\matlab\lang\varargin.m
and when I remove the varargin sentence it gives the same results as outputs.
please give advice.

Lioo lys

EEElearner

I have successfully created the mex files. What next? How can I make use of the other programs shared in this file?

mk

mk (view profile)

what is the algorithm of treebagger based on? and randomforest is the algorithm of treebagger?

mk

mk (view profile)

Alean

Alean (view profile)

Undefined variable "internal" or class "internal.stats.getargs".

Error in eval_Stochastic_Bosque (line 12)
[eid,emsg,oobe_flag] = internal.stats.getargs(okargs,defaults,varargin{:});

anne_frank

I am new at MATLAB and I'm not able to run this. Could someone please help me with the function call?

ashok kn

mx_compile_cartree
Error: Could not detect a compiler on local system
which can compile the specified input file(s)

Works great, although I would like to have greater control over the creation of decision trees. For example I would like to preset the maximum depth.

I also run into trouble with internal.stats.getargs and had to change it to just "getargs"

biao qi

biao qi

??? Error: File: cartree.m Line: 50 Column: 26
Expression or statement is incorrect--possibly unbalanced (, {, or [.

Error in ==> Stochastic_Bosque at 48
Random_ForestT = cartree(Data(TDindx,:),Labels(TDindx), ...
My matlab appears the above fault, can you help me ?

KALYAN KUMAR

i am getting this error can u please sort out this issue

Undefined variable "internal" or class "internal.stats.getargs".

Error in eval_Stochastic_Bosque (line 12)
[eid,emsg,oobe_flag] = internal.stats.getargs(okargs,defaults,varargin{:});

Hi everybody,
I would like to thank Leo for this code,
Does any one know how to use this code for One class RF

Thanks

Hi Leo
Thank you very much for your greats *.m codes.
I have a question concerning 'weights' input parameter from Stochastic_Bosque.m
If my 'Data' input parameter is constitute from three examples of label '1' and three example of label '2'
If i want
-the weight of label '1' instances to be 6 and
- the weight of label '2' instances to be 3
Does weights=[6;6;6;3;3;3]?

Data=[....;....;....;....;....;....]
labels=[1;1;1;2;2;2];
so > weights=[6;6;6;3;3;3]?

Regards
Olivier

vimal

vimal (view profile)

hi leo
when i using your code , founds following error
Undefined function 'best_cut_node' for input arguments of type 'char'.

Error in cartree (line 84)
[bestCutVar bestCutValue] = ...

Error in Stochastic_Bosque (line 53)
Random_ForestT = cartree(Data(TDindx,:),Labels(TDindx), ....

please tell me how to debug this

Hira Imtiaz

Hi Leo ...
I am a student and have to implement random forest algorithm on ECG signal feature vectors. I am finding features in form of different peaks in the signal by this method http://www.codeproject.com/Articles/309938/ECG-Feature-Extraction-with-Wavelet-Transform-and
Can you plx tell me how can i apply your Random Forest algo code on the above results?
This algo is very new for me , Could u plx help...
Regards

Zhiming

Hi Leo,
My problem has already solved. It can not run with data type 'uint8'.
Thanks.

Zhiming

Hi Leo,
I have compiled with visual C++ succeed. But I can not run the code. Some of error information are listed below:
----------------------------------------------
Segmentation violation detected at Wed Jan 07 09:43:26 2015
----------------------------------------------
Fault Count: 1

Register State:
EAX = 00007d23 EBX = 00001736
ECX = 0bc71998 EDX = 0bc60000
ESI = ffffdccd EDI = 0bc7d348
EBP = 00c2c0cc ESP = 00c2c0c0
EIP = 0ba623f8 FLG = 00010202

Stack Trace:
[0] best_cut_node.mexw32:0x0ba623f8(0x0bc71998, 0x0bc60048, 5942, 5999)
[1] best_cut_node.mexw32:0x0ba6278c(6000, 28, 0x45e97450, 0x45f99450)
[2] best_cut_node.mexw32:0x0ba61182(2, 0x04dacdd0, 6, 1)
......

Could you help me? Thanks!

joy barbosa

Thank Leo for sharing this code!

Following the solutions provided in the comments down here, I was able to run your code. But, could you help me figure out how to come up with the probability estimates using this code?

Hussein

Could anyone give an example of how to use this function, I mean the input parameters,.... I successfully built it, so if anyone could please advise. Thanks

fairy

fairy (view profile)

Fatemeh Saki

Hi everybody,
Does any one know how can I visualize the built tree after training?
Thanks

Fatemeh

Gary Tsui

try
in GBCC.cpp line 3
#define log2(x) ( (1.0/log(2.0)) * log( (double)(x) ) ) // use double constants

that's what i did, can anyone else help to verify?

Fatemeh Saki

Hi Leo,
I can not run the code. Error occurs during the compiling mx_compile file !!!
Would you please help me with that?

fairy

fairy (view profile)

I have found the reason.Because the Data mat is integral not double.

fairy

fairy (view profile)

Thanks Leo!

With your help,I have compiled succed !

by the way,the line 24 should change to saved_logs[j] = log((double)(j+1))/log(2.0);
.....

Thanks Leo again.

Leo

Leo (view profile)

Hi fairy,

lcc is not a cpp compiler. Using the Visual Studio compiler I think the following should do the trick.

in GBCC.cpp change line 24 to

saved_logs[j] = log(j+1)/log(2);

line 115 to

if (diff_labels[nl]>0) bh-=diff_labels[nl]*(log(diff_labels[nl])/log(2)-log(sum_W)/log(2));

line 151 to

if(diff_labels_l[nl]>0) ch-=(diff_labels_l[nl])*(log(diff_labels_l[nl])/log(2)-log(sum_l)/log(2));

and line 152 to

if(diff_labels_r[nl]>0) ch-=(diff_labels_r[nl])*(log(diff_labels_r[nl])/log(2)-log(sum_W-sum_l)/log(2));

Hope this solves it.

Leo

fairy

fairy (view profile)

Hi Leo

These are the erros and Warings when I compile 'mx_compile_cartree'
mx_compile_cartree
GBCC.cpp
GBCC.cpp(24) : error C2563: mismatch in formal parameter list
GBCC.cpp(24) : error C2568: '=' : unable to resolve function overload
C:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\math.h(567): could be 'long double log(long double)'
C:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\math.h(519): or 'float log(float)'
C:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\math.h(121): or 'double log(double)'
GBCC.cpp(24) : error C2143: syntax error : missing ';' before 'constant'
GBCC.cpp(24) : error C2064: term does not evaluate to a function taking 1 arguments
GBCC.cpp(25) : warning C4244: '=' : conversion from 'double' to 'int', possible loss of data
GBCC.cpp(43) : warning C4244: '=' : conversion from 'double' to 'int', possible loss of data
GBCC.cpp(108) : warning C4244: '=' : conversion from 'double' to 'int', possible loss of data
GBCC.cpp(115) : error C3861: 'log2': identifier not found
GBCC.cpp(115) : error C3861: 'log2': identifier not found
GBCC.cpp(127) : warning C4244: '=' : conversion from 'double' to 'int', possible loss of data
GBCC.cpp(151) : error C3861: 'log2': identifier not found
GBCC.cpp(151) : error C3861: 'log2': identifier not found
GBCC.cpp(152) : error C3861: 'log2': identifier not found
GBCC.cpp(152) : error C3861: 'log2': identifier not found

C:\PROGRA~1\MATLAB\R2012B\BIN\MEX.PL: Error: Compile of 'GBCC.cpp' failed.

Error using mex (line 206)
Unable to complete successfully.

Error in mx_compile_cartree (line 8)
mex -O best_cut_node.cpp GBCR.cpp GBCP.cpp GBCC.cpp
I use the version 2012(MATLAB) and VC++(2008) .

fairy

fairy (view profile)

Hi Leo

These are the erros and Warings when I compile 'mx_compile_cartree'

lcc preprocessor warning: .\node_cuts.h:8 best_cut_node.cpp:2 No newline at end of file
lcc preprocessor warning: best_cut_node.cpp:60 No newline at end of file
Error best_cut_node.cpp: .\node_cuts.h: 2 redeclaration of `GBCC' previously declared at .\node_cuts.h 1
Error best_cut_node.cpp: .\node_cuts.h: 5 redeclaration of `GBCR' previously declared at .\node_cuts.h 4
Error best_cut_node.cpp: .\node_cuts.h: 8 redeclaration of `GBCP' previously declared at .\node_cuts.h 7
Error best_cut_node.cpp: 35 type error in argument 5 to `GBCC'; found `int' expected `pointer to double'
Error best_cut_node.cpp: 35 type error in argument 7 to `GBCC'; found `pointer to double' expected `int'
Error best_cut_node.cpp: 35 insufficient number of arguments to `GBCC'
Error best_cut_node.cpp: 38 type error in argument 5 to `GBCP'; found `int' expected `pointer to double'
Error best_cut_node.cpp: 38 type error in argument 7 to `GBCP'; found `pointer to double' expected `int'
Error best_cut_node.cpp: 38 insufficient number of arguments to `GBCP'
Error best_cut_node.cpp: 41 type error in argument 5 to `GBCR'; found `int' expected `pointer to double'
Error best_cut_node.cpp: 41 type error in argument 6 to `GBCR'; found `pointer to double' expected `int'
Error best_cut_node.cpp: 41 insufficient number of arguments to `GBCR'
Error best_cut_node.cpp: 59 undeclared identifier `delete'
Error best_cut_node.cpp: 59 illegal expression
Error best_cut_node.cpp: 59 syntax error; found `method' expecting `]'
Error best_cut_node.cpp: 59 type error: pointer expected
Warning best_cut_node.cpp: 59 Statement has no effect
Error best_cut_node.cpp: 59 syntax error; found `method' expecting `;'
Warning best_cut_node.cpp: 59 Statement has no effect
Warning best_cut_node.cpp: 59 possible usage of delete before definition
17 errors, 5 warnings

C:\PROGRA~1\MATLAB\R2012B\BIN\MEX.PL: Error: Compile of 'best_cut_node.cpp' failed.

Error using mex (line 206)
Unable to complete successfully.

Error in mx_compile_cartree (line 8)
mex -O best_cut_node.cpp GBCR.cpp GBCP.cpp GBCC.cpp

I use the version 2012(MATLAB) and VC++(2008) .

Leo

Leo (view profile)

Hi fairy,

Could you copy paste the exact error message you get when running mx_compile_cartree.m ?

Leo

fairy

fairy (view profile)

Hi Leo,Thanks for your help,But now I have another erro!
??? Undefined function or method 'best_cut_node' for input arguments of type 'char'.

Error in ==> cartree at 84
[bestCutVar bestCutValue] = ...

Error in ==> Stochastic_Bosque at 48
Random_ForestT = cartree(Data(TDindx,:),Labels(TDindx), ...

It seems like mx_compile_cartree.m compiled failed. Exactly it was this commend:mex -O best_cut_node.cpp GBCR.cpp GBCP.cpp GBCC.cpp; failed.So why??Thanks again.

zeel

zeel (view profile)

how to use this function I means what are the parameters that I have to give to this function?

Leo

Leo (view profile)

Hi fairy,

It would seem that the function is not in matlab's search path. You can run

addpath(genpath(cd))

Leo

fairy

fairy (view profile)

leo
I paste the code so you can give me adives.Thank you.
load diabetes
Data = diabetes.x;
Labels = diabetes.y;
Random_Forest = Stochastic_Bosque(Data,Labels);

I get this error:
??? Undefined function or method 'cartree' for input arguments of type 'double'.

Error in ==> Stochastic_Bosque at 48
Random_ForestT = cartree(Data(TDindx,:),Labels(TDindx), ...
Could you kindly tell me why??Thanks very much.

LE

LE (view profile)

Hi,Leo.I am afraid this package can not handle the categorical feature.So how could I update these code to handle these dataset with categorical features?
Kindly guide me.
Thanks.

Leo

Leo (view profile)

Hi qing,

Yes the elements of the vector "nodeCutVar" are feature indexes. You can retrieve the tree structure from the field RETree.childnode. For a node i the indexes of the child nodes are RETree.childnode(i) and RETree.childnode(i) + 1, for the left and right child.

Hopes this helps.

Leo

qing

qing (view profile)

Hi leo,

Are the elements of the vector "nodeCutVar" feature indexes? But, how can I see the tree structure? I mean the relationship between features. Thanks!

Linh Dang

Could anybody give some example how to run these file. Really appreciate.

Marios

Marios (view profile)

Michael

Quick, clean and easy to use.
A useful submission.

mai

mai (view profile)

hai leo..
im just new in matlab and would like to explore more about random forest. but im not understand most of them.
function Random_Forest = Stochastic_Bosque(Data,Labels,varargin)
data is refer to my data.
what is for labels and varargin?

Michael

Excellent work, code is well documented and clear, plus runtime is reasonable.

Adding a Readme file with description of the data format, and a demo.m would be very helpful.

Thanks for sharing.

Leo

Leo (view profile)

Hi Matteo,

Sorry for the late reply, did not receive a notification email.

Anyway you are correct, that is a bug in the code. It was pointed out by c. a few comments up.

The code has now been updated to remove that line.

Thanks for the feedback and rating.

Matteo

Matteo (view profile)

Very good software, thank you for your effort!
I was wondering whether a replace with replacement is really implemented in this method as the documentation says.
When you make (lines 43-44):
TDindx = round(numel(Labels)*rand(n,1)+.5);
(NOTE: why not using 'randi' function?)
you get 'n' indexes and THEN you make:
TDindx = unique(TDindx);
removing all the duplicates (or more)!!!
Is it correct?

Jeff

Jeff (view profile)

Hi Leo, based on your experience if this program is converted into pure C/C++, does that help improving the processing speed on PC?

I noticed that the f_output vector sometimes swaps dimensions in eval_Stochastic_Bosque(). Quickfix:

Add this at Line34 in eval_Stochastic_Bosque():
if (size(Data,1) ~= size(f_output,1))
f_output = f_output';
f_votes = f_votes';
end

Leo

Leo (view profile)

Hi C.

Thanks for pointing that out. I believe you are correct, that line should be commented out.

c.

c. (view profile)

Hi,
why do you call
TDindx = unique(TDindx);
when creating the forest?
I was under the impression that the use of bagging would improve the generalization abilities of the model, but through the call of unique, we are getting rid of all multiple instances. Why did you chose to not use bagging, but rather use subsets of the original data?

Ming

Ming (view profile)

Impressive.
Hi all,
Some above said that the package failed to be complied in windows. I found that probably it is because in GBCC.cpp, log2() is not a C standard function. A feasible solution is to replace log2(N) with log((double)n)/log(double(2)).
Thanks again for the code sharing.

Ming

Ming (view profile)

Afzan

tq.

Leo

Leo (view profile)

Hi Afzan,

It is in :

/Stochastic_Bosque/cartree/mx_files

It's a C++ file

Leo

Afzan

Leo.. where is the best_cut_node function(in cartree line 45)? I cant find it, or its compatibality problem again?

Leo

Leo (view profile)

Hi Kourosh,

Sorry for the late reply. Unfortunately I dont have a windows machine to try this out. I am suprised it wont compile under windows though. In Ubuntu it compiles using gcc.

Anyway if you paste the errors here maybe I can help out a bit more.

Hi Leo, thanks for sharing this code. I have difficulty mexing the cpp files. Do i need a special compiler? when I get to this line:
mex -O best_cut_node.cpp GBCR.cpp GBCP.cpp GBCC.cpp
i receive so many errors in node_cuts.h and it finally says: Compile of 'best_cut_node.cpp' failed.
I'm using R2007b on win32.
could you help?
thanks,
kourosh

Leo

Leo (view profile)

Hey,

sorry, the update was pending approval. It should be ok now.

AMB

AMB (view profile)

Hi - the zip file that I can download now has the same creation date as the old - requires all the changes I made in order to run it, and performs the same as well - please advise

Leo

Leo (view profile)

Hey,

I have already updated the code. You can re-download it.

AMB

AMB (view profile)

Hi - Will you be posting the new code? - I would like to try it out on regression - Right now, when I use the old code on the classic Boston Housing data set, I get all NaNs. I would like to see if this problem disappears in the code you fixed.

Leo

Leo (view profile)

Hi Shujjat,

You would have to show me exactly what line 99 is in your code. In my code I do not get any errors. I suspect you have altered the code and inadvertently added or omitted a parenthesis.

Leo

Shujjat

Hi Leo,
I am following your's and Mohammad's threads and I am getting following errors after doing your indicated amendments.
>> Random_Forest = Stochastic_Bosque(data,label);
??? Error: File: cartree.m Line: 99 Column: 27
Expression or statement is incorrect--possibly unbalanced (, {, or [.

Error in ==> Stochastic_Bosque at 45
Random_ForestT = cartree(Data(TDindx,:),Labels(TDindx), ...

Can you plz help me?
cheers

Leo

Leo (view profile)

Hey AMB,

Thanks a lot for all your help. I found the bug that was causing the difference in performance (accuracy-wise), a part of the code (erroneously) implicitly assumed that features values were distinct.

It is now fixed. And results on the Glass dataset are equivalent to the results you quote for the google code.

Regarding speed, the code seems to run considerably faster on my PC but nowhere near as fast as the google code, which is to be expected as the google code is written almost entirely in C/C++

I have also removed the dependencies from the statistics toolbox using your suggestions (thanks!).

Dependence on internal.stats.getargs has also been removed

AMB

AMB (view profile)

Hi - I have followed your suggestion to compare the results of your code versus the "google code". This google code is at

http://code.google.com/p/randomforest-matlab/

This is a Matlab (and Standalone application) port for the excellent machine learning algorithm `Random Forests' - By Leo Breiman et al. from the R-source by Andy Liaw et al. http://cran.r-project.org/web/packages/randomForest/index.html ( Fortran original by Leo Breiman and Adele Cutler, R port by Andy Liaw and Matthew Wiener.) Current code version is based on 4.5-29 from source of randomForest package by Abhishek Jaiantilal.

Against the "glass" data set here are the statistics for 10 and 100 trees, withholding the 35% of the data as you had done.

For RandomBosque, the results were:

Elapsed time for 1000 runs: 1648.743 seconds
Average number correct with 35% samples held out: 0.636 for 10 trees 0.684 for 100 trees
Standard deviation correct with 35% samples held out: 0.076 for 10 trees 0.070 for 100 trees

For class_RFtrain and classRFpredict, the results were:

Elapsed time for 1000 runs: 88.021 seconds
Average number correct with 35% samples held out: 0.722 for 10 trees 0.758 for 100 trees
Standard deviation correct with 35% samples held out: 0.051 for 10 trees 0.048 for 100 trees

I use a MACAIR with MATLAB 2011a and OS 10.6.7.
I was surprised at the runtime differences and the differences in the statistics.

My calls to Randombosque look as follows:

tic;
correct = zeros(1000,2);
for i = 1:length(correct);

M = length(Labels);
m = round(.65*M);
intraining = randperm(M);
intraining = sort(intraining(1:m));
notintraining = setdiff([1:M],intraining);

Random_Forest = Stochastic_Bosque(Data(intraining,:),Labels(:,intraining),'ntrees',10);
[f_output f_votes] = eval_Stochastic_Bosque(Data(notintraining,:),Random_Forest);
error = Labels(:,notintraining)'-f_output;
correctlyclassified = numel(find(error == 0))/numel(error);
correct(i,1) = correctlyclassified;

Random_Forest = Stochastic_Bosque(Data(intraining,:),Labels(:,intraining),'ntrees',100);
[f_output f_votes] = eval_Stochastic_Bosque(Data(notintraining,:),Random_Forest);
error = Labels(:,notintraining)'-f_output;
correctlyclassified = numel(find(error == 0))/numel(error);
correct(i,2) = correctlyclassified;

if rem(i,25) == 1
fprintf('Iteration: %3.0f\n',i);
end
end
toc;

fprintf('Elapsed time for %3.0f runs: %5.3f seconds\n',length(correct),toc)
fprintf('Average number correct with 35%% samples held out: %5.3f for 10 trees %5.3f for 100 trees \n',mean(correct));
fprintf('Standard deviation correct with 35%% samples held out: %5.3f for 10 trees %5.3f for 100 trees\n',std(correct));

Leo

Leo (view profile)

Waleed Hi,

Unfortunately it is quite hard to figure out what the problem is without more specific feedback. On top of this, the getargs function is not my code so I am not that familiar with how it works (or how it can fail).

Perhaps what you could do is remove that line of code all together and hard code the parameters. For example in the case of the cartree function make it :

function RETree = cartree(Data,Labels)

and then replace the call to getargs by :

minparent=2;
minleaf=1;
m=size(Data,2);
method= 'c';
W= [];

Alternatively you could make the call to cartree :

function RETree = cartree(Data,Labels,minparent,minleaf,m,method,W)

remove the call to getargs and just make sure you pass values for all the parameters whenever you call cartree.

If you want to look into more fancy options for passing parameters, you might find this thread useful :

http://stackoverflow.com/questions/2775263/how-to-deal-with-name-value-pairs-of-function-arguments-in-matlab

Leo

Leo (view profile)

Hi AMB,

Unfortunately I dont have the google code installed to compare, but I ran comparisons with matlab's TreeBagger (glass data, 140/74 split, 10 trees) and got similar results for the two methods (my code seems to give better results though I am not sure why).

Leo

Leo (view profile)

Hi AMB,

Thanks for all the feedback. You make some very good suggestions which I will try to incorporate soon, especially concerning the randsample dependency (which hadnt crossed my mind).

For the datasets you are testing on : I tested on Glass with 10 trees and got ~=72% accuracy on a 140/74 split. Could you report what splits, number of trees you are using and what accuracies you are getting with this code and the "google" code?

Thanks!

AMB

AMB (view profile)

I have been running my modified code and comparing the results with the version on

http://code.google.com/p/randomforest-matlab/

The results of the present package that I modified as above, against the google code do not agree well. I am using classical datasets such as glass (classification) and boston housing (regression), and the google code has a much higher degree of accuracy. I would be grateful if anyone could share their experience on using these classical data sets to see whether they see the same result in their implementations. The boston data set is at http://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html and the glass data set is at http://archive.ics.uci.edu/ml/datasets/Glass+Identification

AMB

AMB (view profile)

This package was extremely useful. I should say to all that I am just a new student in this field and my comments reflect my interest in learning more, having a toolbox that is accessible, and one that actually works without days of effort. I must say that I have managed to get the other RandomForest implementations (Google code etc...) up and running but only with considerable difficulty owing to mex compilation issues. I did not have this particular difficulty with this package and as a result was delighted.

This package could be improved if it were accompanied by a demonstration file, some instruction on how to build the package and link the paths, and had eliminated the dependency on the randsample statistic toolbox routine, which some users do not have.

After modification of a few lines, the calls to randsample can be replaced, I believe. For instance the call in Random_Bosque:

TDindx = randsample(numel(Labels),n,true);

could I think be replaced with

TDindx = round(numel(Labels)*rand(n,1)+.5);
TDindx = unique(TDindx);

and the call in cartree

node_var = sort(randsample(M,m,0));

could be replaced with

node_var = randperm(M);
node_var = sort(node_var(1:m));

There may be limitations to using these substitutions when M is large, but I was very pleased with the speed of the entire package.

The author's suggestions to replace the internal.stats.getargs with calls to getargs were entirely successful.

On a MAC, the cpp programs mex'd without difficulty. I found it expedient to simply move the mx_eval_cartree.mexmaci64 and the best_cut_node.mexmaci64 and the weighted_hist.m files to the folder containing Stochastic_Bosque.m than to adjust paths.

I used the irisdata as a demonstration. It is short and uncomplicated. It is available from http://en.wikipedia.org/wiki/Iris_flower_data_set

Just copy the data out and place it into an mfile. I put the data into a matrix called Data. To try out the Stochastic Bosque routines, I then wrote

Labels = Data(:,5)';
Data = Data(:,1:4);

and then invoked the package by the calls:

Random_Forest = Stochastic_Bosque(Data,Labels,'ntrees',50);
[f_output f_votes]= eval_Stochastic_Bosque(Data,Random_Forest);
error = Labels'-f_output;
correctlyclassified = numel(find(error == 0))/numel(error)

As I am a beginner, and was operating without a license on the author's source code! I thought it useful to subsample the iris data set so that I would have a test set against which to examine the performance of the Random_Forest. While this was unnecessary from a theoretical standpoint, I thought it was worthwhile from the standpoint of checking that my modifications to the source were not ruinous.

The resulting test code looks like

M = length(Labels);
m = round(.5*M);
intraining = randperm(M);
intraining = sort(intraining(1:m));
notintraining = setdiff([1:M],intraining);
Random_Forest = Stochastic_Bosque(Data(intraining,:),Labels(:,intraining),'ntrees',10);
[f_output f_votes] = eval_Stochastic_Bosque(Data(notintraining,:),Random_Forest);
error = Labels(:,notintraining)'-f_output;
correctlyclassified = numel(find(error == 0))/numel(error)

and I was pleasantly pleased to see that the correctlyclassified measured compared favorably with the original

I should have also liked to see some proximity measures and permutation importance measures present, I speculate that perhaps these were eliminated to produce a package that ran swiftly. At any rate, I shall try to make these myself, because it seems to me that I can write a wrapper and call the Stochastic_Bosque to make my own calculations. If the author would care to offer any further suggestions or caveats, I would like to hear them because I think that his work is useful and can be extended.

Waleed Yousef

Thanks, but what about the error message

Leo

Leo (view profile)

Hi,

The line 45 you refer to has to do with the subsampling of data samples not the features. Each tree is trained using a different subset of the training data.

Waleed Yousef

I received the same errors above, as Mohammed. I corrected them as you advised. I receive now this error:

Error in ==> getargs at 48
emsg = '';

??? Output argument "varargout{7}" (and maybe others) not assigned during call to "C:\MyDocuments\MATLAB\tmp\getargs.m>getargs".

Waleed Yousef

So, what is line 45 in Stochastic_Bosque where you say: cartree(Data(TDindx,:), ...

Doesn't this mean that you enforce a subset of the features on the whole tree.

Leo

Leo (view profile)

Hi,

Random feature selection for the cartrees is done in line 74 :

node_var=sort(randsample(M,m,0));

which is inside the tree construction loop. So it is done separately for each node. Is this what you were referring to or was it another line of code?

Waleed Yousef

Leo, I just skimmed your code. I think you do random selection of features for a tree not for each node in the tree as it should be. Am I right?

Leo

Leo (view profile)

Hi Mohammad,

It seems to be another version incompatability.

Replace :

[unique_labels,~,Labels]= unique(Labels);

with

[unique_labels,dummy,Labels]= unique(Labels);

and it should work.

Leo

Thanks Leo

I got another error using your codes:

??? Error: File: cartree.m Line: 50 Column: 25
Expression or statement is incorrect--possibly unbalanced (, {, or [.

Error in ==> Stochastic_Bosque at 46
Random_ForestT = cartree(Data(TDindx,:),Labels(TDindx), ...

The 50th line is:
[unique_labels,~,Labels]= unique(Labels);
It seems odd; at least for me.

Besides, I wanna know that your code is based on Random subspace method? If so, how many percent of features is used to create feature subsets?

Leo

Leo (view profile)

Default number of features sampled at each node is

round(sqrt(size(Data,2)))

where size(Data,2) is the dimensionality of the data.

You can set this parameter via the
'nvartosample' parameter.

Thanks Leo

I got another error using your codes:

??? Error: File: cartree.m Line: 50 Column: 25
Expression or statement is incorrect--possibly unbalanced (, {, or [.

Error in ==> Stochastic_Bosque at 46
Random_ForestT = cartree(Data(TDindx,:),Labels(TDindx), ...

The 50th line is:
[unique_labels,~,Labels]= unique(Labels);
It seems odd; at least for me.

Besides, I wanna know that your code is based on Random subspace method? If so, how many percent of features is used to create feature subsets?

Leo

Leo (view profile)

Hi Mohammed,

internal.stats.getargs is an internal Matlab command which I assume is not available in your version. You can download the following :

http://www.mathworks.com/matlabcentral/fileexchange/24082-getargs-m

and simply replace that line by :

[eid,emsg,minparent,minleaf,m,nTrees,n,method,oobe,W] =
getargs(okargs,defaults,varargin{:});

(and similarly in the cartree function :

[eid,emsg,minparent,minleaf,m,method,W] = getargs(okargs,defaults,varargin{:});

)

Hi Leo

When I run this command:
Random_Forest = Stochastic_Bosque(Patterns,Targets);

I get this error:

??? Undefined variable "internal" or class "internal.stats.getargs".

Error in ==> Stochastic_Bosque at 39
[eid,emsg,minparent,minleaf,m,nTrees,n,method,oobe,W] =
internal.stats.getargs(okargs,defaults,varargin{:});

Why?!!

Updates

1.7

Fixed bug (see comment by c.)

1.6

Removed implicit assumption of distinct feature values, removed statistical toolbox dependency, removed internal command dependency

MATLAB Release
MATLAB 7.9 (R2009b)
Acknowledgements

Inspired by: getargs.m

Download apps, toolboxes, and other File Exchange content using Add-On Explorer in MATLAB.

» Watch video