Main Content

ClassificationPartitionedModel

Cross-validated classification model

Description

ClassificationPartitionedModel is a set of classification models trained on cross-validated folds. You can estimate the quality of the classification by using one or more kfold functions: kfoldPredict, kfoldLoss, kfoldMargin, kfoldEdge, and kfoldfun.

Every kfold function uses models trained on training-fold (in-fold) observations to predict the response for validation-fold (out-of-fold) observations. For example, when you use kfoldPredict with a k-fold cross-validated model, the software estimates a response for every observation using the model trained without that observation. For more information, see Partitioned Models.

Creation

You can create a ClassificationPartitionedModel object in two ways:

  • Create a cross-validated model from a full classification model object by using the crossval object function.

  • Create a cross-validated model by using the function fitcdiscr, fitcknn, fitcnb, fitcsvm, or fitctree, and specifying one of the name-value arguments CrossVal, KFold, Holdout, Leaveout, or CVPartition.

Properties

expand all

Cross-Validation Properties

This property is read-only.

Name of the cross-validated model, returned as a character vector.

Data Types: char

This property is read-only.

Number of folds in the cross-validated model, returned as a positive integer.

Data Types: double

This property is read-only.

Parameters of the cross-validated model, returned as an object.

This property is read-only.

Partition used in the cross-validation, returned as a cvpartition object.

This property is read-only.

Trained learners, returned as a cell array of compact classification models. For more information, see Partitioned Models.

Data Types: cell

Other Classification Properties

This property is read-only.

Bin edges for numeric predictors, returned as a cell array of p numeric vectors, where p is the number of predictors. Each vector includes the bin edges for a numeric predictor. The element in the cell array for a categorical predictor is empty because the software does not bin categorical predictors.

The software bins numeric predictors only if you specify the NumBins name-value argument as a positive integer scalar when training a model with tree learners. The BinEdges property is empty if the NumBins value is empty (default).

You can reproduce the binned predictor data Xbinned by using the BinEdges property of the trained model mdl.

X = mdl.X; % Predictor data
Xbinned = zeros(size(X));
edges = mdl.BinEdges;
% Find indices of binned predictors.
idxNumeric = find(~cellfun(@isempty,edges));
if iscolumn(idxNumeric)
    idxNumeric = idxNumeric';
end
for j = idxNumeric 
    x = X(:,j);
    % Convert x to array if x is a table.
    if istable(x) 
        x = table2array(x);
    end
    % Group x into bins by using the discretize function.
    xbinned = discretize(x,[-inf; edges{j}; inf]); 
    Xbinned(:,j) = xbinned;
end
Xbinned contains the bin indices, ranging from 1 to the number of bins, for the numeric predictors. Xbinned values are 0 for categorical predictors. If X contains NaNs, then the corresponding Xbinned values are NaNs.

Data Types: cell

This property is read-only.

Categorical predictor indices, returned as a vector of positive integers. CategoricalPredictors contains index values indicating that the corresponding predictors are categorical. The index values are between 1 and p, where p is the number of predictors used to train the model. If none of the predictors are categorical, then this property is empty ([]).

Data Types: single | double

This property is read-only.

Unique class labels used in training, returned as a categorical or character array, logical or numeric vector, or cell array of character vectors. ClassNames has the same data type as the class labels Y. (The software treats string arrays as cell arrays of character vectors.) ClassNames also determines the class order.

Data Types: categorical | char | logical | single | double | cell

Misclassification costs, specified as a square numeric matrix. Cost has K rows and columns, where K is the number of classes.

Cost(i,j) is the cost of classifying a point into class j if its true class is i. The order of the rows and columns of Cost corresponds to the order of the classes in ClassNames.

If the model is a cross-validated ClassificationDiscriminant, ClassificationKNN, or ClassificationNaiveBayes model, then you can change its cost matrix using dot notation. For example, for a cross-validated model CVMdl and a cost matrix costMatrix, you can specify:

CVMdl.Cost = costMatrix;

Data Types: double

This property is read-only.

Number of observations in the training data, returned as a positive integer. NumObservations can be less than the number of rows of input data when there are missing values in the input data or response data.

Data Types: double

This property is read-only.

Predictor names in order of their appearance in the predictor data X, returned as a cell array of character vectors. The length of PredictorNames is equal to the number of columns in X.

Data Types: cell

Prior probabilities for each class, specified as a numeric vector. The order of the elements of Prior corresponds to the order of the classes in ClassNames.

If the model is a cross-validated ClassificationDiscriminant or ClassificationNaiveBayes model, then you can change its vector of priors using dot notation. For example, for a cross-validated model CVMdl and a vector of prior probabilities priorVector, you can specify:

CVMdl.Prior = priorVector;

Data Types: double

This property is read-only.

Name of the response variable, returned as a character vector.

Data Types: char

Score transformation function, specified as a character vector, string scalar, or function handle. ScoreTransform represents a built-in transformation function or a function handle for transforming predicted classification scores.

To change the score transformation function to function, for example, use dot notation.

  • For a built-in function, enter a character vector or string scalar.

    Mdl.ScoreTransform = "function";

    This table lists the values for the available built-in functions.

    ValueDescription
    "doublelogit"1/(1 + e–2x)
    "invlogit"log(x / (1 – x))
    "ismax"Sets the score for the class with the largest score to 1, and sets the scores for all other classes to 0
    "logit"1/(1 + ex)
    "none" or "identity"x (no transformation)
    "sign"–1 for x < 0
    0 for x = 0
    1 for x > 0
    "symmetric"2x – 1
    "symmetricismax"Sets the score for the class with the largest score to 1, and sets the scores for all other classes to –1
    "symmetriclogit"2/(1 + ex) – 1

  • For a MATLAB® function or a function that you define, enter its function handle.

    Mdl.ScoreTransform = @function;

    function must accept a matrix (the original scores) and return a matrix of the same size (the transformed scores).

Data Types: char | string | function_handle

This property is read-only.

Scaled weights in the model, returned as a numeric vector. W has length n, the number of rows in the training data.

Data Types: double

This property is read-only.

Predictor values, returned as a real matrix or table. Each column of X represents one variable (predictor), and each row represents one observation.

Data Types: double | table

This property is read-only.

Class labels corresponding to the observations in X, returned as a categorical array, cell array of character vectors, character array, logical vector, or numeric vector. Each row of Y represents the classification of the corresponding row of X.

Data Types: single | double | logical | char | string | cell | categorical

Object Functions

gatherGather properties of Statistics and Machine Learning Toolbox object from GPU
kfoldEdgeClassification edge for cross-validated classification model
kfoldLossClassification loss for cross-validated classification model
kfoldMarginClassification margins for cross-validated classification model
kfoldPredictClassify observations in cross-validated classification model
kfoldfunCross-validate function for classification

Examples

collapse all

Evaluate the 10-fold cross-validation error for a classification tree model.

Load Fisher's iris data set.

load fisheriris

Train a classification tree using default options.

Mdl = fitctree(meas,species);

Cross-validate the classification tree model.

CVMdl = crossval(Mdl);

Estimate the 10-fold cross-validation loss.

L = kfoldLoss(CVMdl)
L = 
0.0533

Estimate positive class posterior probabilities for the test set of an SVM algorithm.

Load the ionosphere data set.

load ionosphere

Train an SVM classifier. Specify a 20% holdout sample. Standardize the predictors and specify the class order.

rng(1) % For reproducibility
CVSVMModel = fitcsvm(X,Y,Holdout=0.2,Standardize=true, ...
    ClassNames={'b','g'});

CVSVMModel is a trained ClassificationPartitionedModel cross-validated classifier.

Estimate the optimal score function for mapping observation scores to posterior probabilities of an observation being classified as g.

ScoreCVSVMModel = fitSVMPosterior(CVSVMModel);

ScoreCVSVMModel is a trained ClassificationPartitionedModel cross-validated classifier containing the optimal score transformation function estimated from the training data.

Estimate the out-of-sample positive class posterior probabilities. Display the results for the first 10 out-of-sample observations.

[~,OOSPostProbs] = kfoldPredict(ScoreCVSVMModel);
indx = ~isnan(OOSPostProbs(:,2));
hoObs = find(indx); % Holdout observation numbers
OOSPostProbs = [hoObs, OOSPostProbs(indx,2)];
table(OOSPostProbs(1:10,1),OOSPostProbs(1:10,2), ...
    VariableNames=["ObservationIndex","PosteriorProbability"])
ans=10×2 table
    ObservationIndex    PosteriorProbability
    ________________    ____________________

            6                   0.17378     
            7                   0.89637     
            8                 0.0076583     
            9                   0.91602     
           16                  0.026715     
           22                 4.609e-06     
           23                    0.9024     
           24                2.4135e-06     
           38                0.00042673     
           41                   0.86427     

Compute the loss and the predictions for a classification model, first partitioned using holdout validation and then partitioned using 3-fold cross-validation. Compare the two sets of losses and predictions.

Create a table from the fisheriris data set, which contains length and width measurements from the sepals and petals of three species of iris flowers. View the first eight observations.

fisheriris = readtable("fisheriris.csv");
head(fisheriris)
    SepalLength    SepalWidth    PetalLength    PetalWidth     Species  
    ___________    __________    ___________    __________    __________

        5.1           3.5            1.4           0.2        {'setosa'}
        4.9             3            1.4           0.2        {'setosa'}
        4.7           3.2            1.3           0.2        {'setosa'}
        4.6           3.1            1.5           0.2        {'setosa'}
          5           3.6            1.4           0.2        {'setosa'}
        5.4           3.9            1.7           0.4        {'setosa'}
        4.6           3.4            1.4           0.3        {'setosa'}
          5           3.4            1.5           0.2        {'setosa'}

Partition the data using cvpartition. First, create a partition for holdout validation, using approximately 70% of the observations for the training data and 30% for the validation data. Then, create a partition for 3-fold cross-validation.

rng(0,"twister") % For reproducibility
holdoutPartition = cvpartition(fisheriris.Species,Holdout=0.30);
kfoldPartition = cvpartition(fisheriris.Species,KFold=3);

holdoutPartition and kfoldPartition are both stratified random partitions. You can use the training and test functions to find the indices for the observations in the training and validation sets, respectively.

Train a classification tree model using the fisheriris data. Specify Species as the response variable.

Mdl = fitctree(fisheriris,"Species");

Create the partitioned classification models using crossval.

holdoutMdl = crossval(Mdl,CVPartition=holdoutPartition)
holdoutMdl = 
  ClassificationPartitionedModel
    CrossValidatedModel: 'Tree'
         PredictorNames: {'SepalLength'  'SepalWidth'  'PetalLength'  'PetalWidth'}
           ResponseName: 'Species'
        NumObservations: 150
                  KFold: 1
              Partition: [1×1 cvpartition]
             ClassNames: {'setosa'  'versicolor'  'virginica'}
         ScoreTransform: 'none'


  Properties, Methods

kfoldMdl = crossval(Mdl,CVPartition=kfoldPartition)
kfoldMdl = 
  ClassificationPartitionedModel
    CrossValidatedModel: 'Tree'
         PredictorNames: {'SepalLength'  'SepalWidth'  'PetalLength'  'PetalWidth'}
           ResponseName: 'Species'
        NumObservations: 150
                  KFold: 3
              Partition: [1×1 cvpartition]
             ClassNames: {'setosa'  'versicolor'  'virginica'}
         ScoreTransform: 'none'


  Properties, Methods

holdoutMdl and kfoldMdl are ClassificationPartitionedModel objects.

Compute the minimal expected misclassification cost for holdoutMdl and kfoldMdl using kfoldLoss. Because both models use the default cost matrix, this cost is the same as the classification error.

holdoutL = kfoldLoss(holdoutMdl)
holdoutL = 
0.0889
kfoldL = kfoldLoss(kfoldMdl)
kfoldL = 
0.0600

holdoutL is the error computed using the predictions for one validation set, while kfoldL is an average error computed using the predictions for three folds of validation data. Cross-validation metrics tend to be better indicators of a model's performance on unseen data.

Compute the validation data predictions for the two models using kfoldPredict.

[holdoutLabels,holdoutScores] = kfoldPredict(holdoutMdl);
[kfoldLabels,kfoldScores] = kfoldPredict(kfoldMdl);

holdoutClassNames = holdoutMdl.ClassNames;
holdoutScores = array2table(holdoutScores,VariableNames=holdoutClassNames);
kfoldClassNames = kfoldMdl.ClassNames;
kfoldScores = array2table(kfoldScores,VariableNames=kfoldClassNames);

predictions = table(holdoutLabels,kfoldLabels, ...
    holdoutScores,kfoldScores, ...
    VariableNames=["holdoutMdl Labels","kfoldMdl Labels", ...
    "holdoutMdl Scores","kfoldMdl Scores"])
predictions=150×4 table
    holdoutMdl Labels    kfoldMdl Labels            holdoutMdl Scores                     kfoldMdl Scores         
    _________________    _______________    _________________________________    _________________________________

                                            setosa    versicolor    virginica    setosa    versicolor    virginica
                                            ______    __________    _________    ______    __________    _________
                                                                                                                  
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}          1           0             0         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}          1           0             0         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
       {'setosa'}          {'setosa'}          1           0             0         1           0             0    
       {'setosa'}          {'setosa'}          1           0             0         1           0             0    
       {'setosa'}          {'setosa'}        NaN         NaN           NaN         1           0             0    
      ⋮

kfoldPredict returns NaN scores for the observations used to train holdoutMdl.Trained. For these observations, the function selects the class label with the highest frequency as the predicted label. In this case, because all classes have the same frequency, the function selects the first class (setosa) as the predicted label. The function uses the trained model to return predictions for the validation set observations. kfoldPredict returns each kfoldMdl prediction using the model in kfoldMdl.Trained that was trained without that observation.

To predict responses for unseen data, use the model trained on the entire data set (Mdl) and its predict function rather than a partitioned model such as holdoutMdl or kfoldMdl.

Tips

To estimate posterior probabilities of trained, cross-validated SVM classifiers, use fitSVMPosterior.

Algorithms

expand all

Extended Capabilities

expand all

Version History

Introduced in R2011a

expand all