Documentation

This is machine translation

Translated by Microsoft
Mouse over text to see original. Click the button below to return to the English verison of the page.

predict

Class: ClassificationKNN

Predict labels using k-nearest neighbor classification model

Syntax

  • label = predict(Mdl,X)
  • [label,score,cost] = predict(Mdl,X)

Description

label = predict(Mdl,X) returns a vector of predicted class label for the predictor data in the table or matrix X, based on the trained k-nearest neighbor classification model Mdl.

[label,score,cost] = predict(Mdl,X) also returns:

  • A matrix of classification scores (score) indicating the likelihood that a label comes from a particular class. For k-nearest neighbor, scores are posterior probabilities.

  • A matrix of expected classification cost (cost). For each observation in X, the predicted class label corresponds to the minimum expected classification costs among all classes.

Input Arguments

expand all

k-nearest neighbor classification model, specified as a ClassificationKNN model object returned by fitcknn.

Predictor data to be classified, specified as a numeric matrix or table.

Each row of X corresponds to one observation, and each column corresponds to one variable.

  • For a numeric matrix:

    • The variables making up the columns of X must have the same order as the predictor variables that trained Mdl.

    • If you trained Mdl using a table (for example, Tbl), then X can be a numeric matrix if Tbl contains all numeric predictor variables. To treat all numeric predictors in Tbl as categorical during training (k-nearest neighbors requires homogeneous predictors), set CategoricalPredictors,'all' when you train using fitcknn. If Tbl contains heterogeneous predictors (for example, numeric and categorical data types) and X is a numeric matrix, then predict throws an error.

  • For a table:

    • predict does not support multi-column variables and cell arrays other than cell arrays of character vectors.

    • If you trained Mdl using a table (for example, Tbl), then all predictor variables in X must have the same variable names and data types as those that trained Mdl (stored in Mdl.PredictorNames). However, the column order of X does not need to correspond to the column order of Tbl. Tbl and X can contain additional variables (response variables, observation weights, etc.), but predict ignores them.

    • If you trained Mdl using a numeric matrix, then the predictor names in Mdl.PredictorNames and corresponding predictor variable names in X must be the same. To specify predictor names during training, see the PredictorNames name-value pair argument of fitcknn. All predictor variables in X must be numeric vectors. X can contain additional variables (response variables, observation weights, etc.), but predict ignores them.

If you set 'Standardize',true in fitcknn to train Mdl, then the software standardizes the columns of X using the corresponding means in Mdl.Mu and standard deviations in Mdl.Sigma.

Data Types: table | double | single

Output Arguments

expand all

Predicted class labels for the observations (rows) in X, returned as a vector with length equal to the number of rows of X. The label is the class with minimal expected cost. See Predicted Class Label.

Predicted class scores or posterior probabilities, returned as a numeric matrix of size N-by-K. N is the number of observations (rows) in X, and K is the number of classes (in Mdl.ClassNames). score(i,j) is the posterior probability that observation i in X is of class j in Mdl.ClassNames. See Posterior Probability.

Expected costs, returned as a matrix of size N-by-K. N is the number of observations (rows) in X, and K is the number of classes (in Mdl.ClassNames). cost(i,j) is the cost of classifying row i of X as class j in Mdl.ClassNames. See Expected Cost.

Definitions

Predicted Class Label

predict classifies so as to minimize the expected classification cost:

y^=argminy=1,...,Kk=1KP^(k|x)C(y|k),

where

  • y^ is the predicted classification.

  • K is the number of classes.

  • P^(k|x) is the posterior probability of class k for observation x.

  • C(y|k) is the cost of classifying an observation as y when its true class is k.

Posterior Probability

For a vector (single query point) Xnew and model mdl, let:

  • K be the number of nearest neighbors used in prediction, mdl.NumNeighbors

  • nbd(mdl,Xnew) be the K nearest neighbors to Xnew in mdl.X

  • Y(nbd) be the classifications of the points in nbd(mdl,Xnew), namely mdl.Y(nbd)

  • W(nbd) be the weights of the points in nbd(mdl,Xnew)

  • prior be the priors of the classes in mdl.Y

If there is a vector of prior probabilities, then the observation weights W are normalized by class to sum to the priors. This might involve a calculation for the point Xnew, because weights can depend on the distance from Xnew to the points in mdl.X.

The posterior probability p(j|Xnew) is

p(j|Xnew)=inbdW(i)1Y(X(i)=j)inbdW(i).

Here, 1Y(X(i)=j) means 1 when mdl.Y(i) = j, and 0 otherwise.

True Misclassification Cost

There are two costs associated with KNN classification: the true misclassification cost per class, and the expected misclassification cost per observation.

You can set the true misclassification cost per class in the Cost name-value pair when you run fitcknn. Cost(i,j) is the cost of classifying an observation into class j if its true class is i. By default, Cost(i,j)=1 if i~=j, and Cost(i,j)=0 if i=j. In other words, the cost is 0 for correct classification, and 1 for incorrect classification.

Expected Cost

There are two costs associated with KNN classification: the true misclassification cost per class, and the expected misclassification cost per observation. The third output of predict is the expected misclassification cost per observation.

Suppose you have Nobs observations that you want to classify with a trained classifier mdl. Suppose you have K classes. You place the observations into a matrix Xnew with one observation per row. The command

[label,score,cost] = predict(mdl,Xnew)

returns, among other outputs, a cost matrix of size Nobs-by-K. Each row of the cost matrix contains the expected (average) cost of classifying the observation into each of the K classes. cost(n,k) is

i=1KP^(i|Xnew(n))C(k|i),

where

Examples

expand all

Construct a k-nearest neighbor classifier for Fisher's iris data, where k = 5. Evaluate some model predictions on new data.

Load the data.

load fisheriris
X = meas;
Y = species;

Construct a classifier for 5-nearest neighbors. It is good practice to standardize non-categorical predictor data.

mdl = fitcknn(X,Y,'NumNeighbors',5,'Standardize',1);

Predict the classifications for flowers with minimum, mean, and maximum characteristics.

Xnew = [min(X);mean(X);max(X)];
[label,score,cost] = predict(mdl,Xnew)
label =

  3×1 cell array

    'versicolor'
    'versicolor'
    'virginica'


score =

    0.4000    0.6000         0
         0    1.0000         0
         0         0    1.0000


cost =

    0.6000    0.4000    1.0000
    1.0000         0    1.0000
    1.0000    1.0000         0

The classifications have binary values for the score and cost matrices, meaning all five nearest neighbors of each of the three points have identical classifications.

Related Examples

Was this topic helpful?