When you have missing data, trees and ensembles of trees give better predictions when they include surrogate splits. Furthermore, estimates of predictor importance are often different with surrogate splits. Eliminating unimportant predictors can save time and memory for predictions, and can make predictions easier to understand.
This example shows the effects of surrogate splits for predictions for data containing missing entries in the test set.
Load sample data. Partition it into a training and test set.
load ionosphere; rng(10) % for reproducibility cv = cvpartition(Y,'Holdout',0.3); Xtrain = X(training(cv),:); Ytrain = Y(training(cv)); Xtest = X(test(cv),:); Ytest = Y(test(cv));
Bag decision trees with and without surrogate splits.
b = fitcensemble(Xtrain,Ytrain,'Method','Bag','NumLearningCycles',50); templS = templateTree('Surrogate','On'); bs = fitcensemble(Xtrain,Ytrain,'Method','Bag','NumLearningCycles',50,... 'Learners',templS);
Suppose half of the values in the test set are missing.
Xtest(rand(size(Xtest))>0.5) = NaN;
Test accuracy with and without surrogate splits.
figure; plot(loss(b,Xtest,Ytest,'Mode','Cumulative')); hold on; plot(loss(bs,Xtest,Ytest,'Mode','Cumulative'),'r--'); legend('Regular trees','Trees with surrogate splits'); xlabel('Number of trees'); ylabel('Test classification error');
Check the statistical significance of the difference in results with the McNemar test. Convert the labels to a nominal data type to make it easier to check for equality.
Yfit = nominal(predict(b,Xtest)); YfitS = nominal(predict(bs,Xtest)); N10 = sum(Yfit==nominal(Ytest) & YfitS~=nominal(Ytest)); N01 = sum(Yfit~=nominal(Ytest) & YfitS==nominal(Ytest)); mcnemar = (abs(N10-N01) - 1)^2/(N10+N01); pval = 1 - chi2cdf(mcnemar,1)
pval = 1.7683e-04
The extremely low p-value indicates that the ensemble with surrogate splits is better in a statistically significant manner.