Documentation

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English verison of the page.

Note: This page has been translated by MathWorks. Please click here
To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

Surrogate Splits

When you have missing data, trees and ensembles of trees give better predictions when they include surrogate splits. Furthermore, estimates of predictor importance are often different with surrogate splits. Eliminating unimportant predictors can save time and memory for predictions, and can make predictions easier to understand.

This example shows the effects of surrogate splits for predictions for data containing missing entries in the test set.

Load sample data. Partition it into a training and test set.

load ionosphere;

rng(10) % for reproducibility
cv = cvpartition(Y,'Holdout',0.3);
Xtrain = X(training(cv),:);
Ytrain = Y(training(cv));
Xtest = X(test(cv),:);
Ytest = Y(test(cv));

Bag decision trees with and without surrogate splits.

b = fitcensemble(Xtrain,Ytrain,'Method','Bag','NumLearningCycles',50);

templS = templateTree('Surrogate','On');
bs = fitcensemble(Xtrain,Ytrain,'Method','Bag','NumLearningCycles',50,...
    'Learners',templS);

Suppose half of the values in the test set are missing.

Xtest(rand(size(Xtest))>0.5) = NaN;

Test accuracy with and without surrogate splits.

figure;
plot(loss(b,Xtest,Ytest,'Mode','Cumulative'));
hold on;
plot(loss(bs,Xtest,Ytest,'Mode','Cumulative'),'r--');
legend('Regular trees','Trees with surrogate splits');
xlabel('Number of trees');
ylabel('Test classification error');

Check the statistical significance of the difference in results with the McNemar test. Convert the labels to a nominal data type to make it easier to check for equality.

Yfit = nominal(predict(b,Xtest));
YfitS = nominal(predict(bs,Xtest));
N10 = sum(Yfit==nominal(Ytest) & YfitS~=nominal(Ytest));
N01 = sum(Yfit~=nominal(Ytest) & YfitS==nominal(Ytest));
mcnemar = (abs(N10-N01) - 1)^2/(N10+N01);
pval = 1 - chi2cdf(mcnemar,1)
pval =

   1.7683e-04

The extremely low p-value indicates that the ensemble with surrogate splits is better in a statistically significant manner.

See Also

|

Related Topics

Was this topic helpful?