# Documentation

### This is machine translation

Translated by
Mouse over text to see original. Click the button below to return to the English verison of the page.

# predict

Class: ClassificationKNN

Predict labels using k-nearest neighbor classification model

## Syntax

• ``label = predict(Mdl,X)``
• ``````[label,score,cost] = predict(Mdl,X)``````

## Description

````label = predict(Mdl,X)` returns a vector of predicted class label for the predictor data in the table or matrix `X`, based on the trained k-nearest neighbor classification model `Mdl`.```
``````[label,score,cost] = predict(Mdl,X)``` also returns:A matrix of classification scores (`score`) indicating the likelihood that a label comes from a particular class. For k-nearest neighbor, scores are posterior probabilities.A matrix of expected classification cost (`cost`). For each observation in `X`, the predicted class label corresponds to the minimum expected classification costs among all classes.```

## Input Arguments

expand all

k-nearest neighbor classification model, specified as a `ClassificationKNN` model object returned by `fitcknn`.

Predictor data to be classified, specified as a numeric matrix or table.

Each row of `X` corresponds to one observation, and each column corresponds to one variable.

• For a numeric matrix:

• The variables making up the columns of `X` must have the same order as the predictor variables that trained `Mdl`.

• If you trained `Mdl` using a table (for example, `Tbl`), then `X` can be a numeric matrix if `Tbl` contains all numeric predictor variables. To treat all numeric predictors in `Tbl` as categorical during training (k-nearest neighbors requires homogeneous predictors), set `CategoricalPredictors,'all'` when you train using `fitcknn`. If `Tbl` contains heterogeneous predictors (for example, numeric and categorical data types) and `X` is a numeric matrix, then `predict` throws an error.

• For a table:

• `predict` does not support multi-column variables and cell arrays other than cell arrays of character vectors.

• If you trained `Mdl` using a table (for example, `Tbl`), then all predictor variables in `X` must have the same variable names and data types as those that trained `Mdl` (stored in `Mdl.PredictorNames`). However, the column order of `X` does not need to correspond to the column order of `Tbl`. `Tbl` and `X` can contain additional variables (response variables, observation weights, etc.), but `predict` ignores them.

• If you trained `Mdl` using a numeric matrix, then the predictor names in `Mdl.PredictorNames` and corresponding predictor variable names in `X` must be the same. To specify predictor names during training, see the `PredictorNames` name-value pair argument of `fitcknn`. All predictor variables in `X` must be numeric vectors. `X` can contain additional variables (response variables, observation weights, etc.), but `predict` ignores them.

If you set `'Standardize',true` in `fitcknn` to train `Mdl`, then the software standardizes the columns of `X` using the corresponding means in `Mdl.Mu` and standard deviations in `Mdl.Sigma`.

Data Types: `table` | `double` | `single`

## Output Arguments

expand all

Predicted class labels for the observations (rows) in `X`, returned as a vector with length equal to the number of rows of `X`. The label is the class with minimal expected cost. See Predicted Class Label.

Predicted class scores or posterior probabilities, returned as a numeric matrix of size `N`-by-`K`. `N` is the number of observations (rows) in `X`, and `K` is the number of classes (in `Mdl.ClassNames`). `score(i,j)` is the posterior probability that observation `i` in `X` is of class `j` in `Mdl.ClassNames`. See Posterior Probability.

Expected costs, returned as a matrix of size `N`-by-`K`. `N` is the number of observations (rows) in `X`, and `K` is the number of classes (in `Mdl.ClassNames`). `cost(i,j)` is the cost of classifying row `i` of `X` as class `j` in `Mdl.ClassNames`. See Expected Cost.

## Definitions

### Predicted Class Label

`predict` classifies so as to minimize the expected classification cost:

`$\stackrel{^}{y}=\underset{y=1,...,K}{\mathrm{arg}\mathrm{min}}\sum _{k=1}^{K}\stackrel{^}{P}\left(k|x\right)C\left(y|k\right),$`

where

• $\stackrel{^}{y}$ is the predicted classification.

• K is the number of classes.

• $\stackrel{^}{P}\left(k|x\right)$ is the posterior probability of class k for observation x.

• $C\left(y|k\right)$ is the cost of classifying an observation as y when its true class is k.

### Posterior Probability

For a vector (single query point) `Xnew` and model `mdl`, let:

• `K` be the number of nearest neighbors used in prediction, `mdl.NumNeighbors`

• `nbd(mdl,Xnew)` be the `K` nearest neighbors to `Xnew` in `mdl.X`

• `Y(nbd)` be the classifications of the points in `nbd(mdl,Xnew)`, namely `mdl.Y(nbd)`

• `W(nbd)` be the weights of the points in `nbd(mdl,Xnew)`

• `prior` be the priors of the classes in `mdl.Y`

If there is a vector of prior probabilities, then the observation weights `W` are normalized by class to sum to the priors. This might involve a calculation for the point `Xnew`, because weights can depend on the distance from `Xnew` to the points in `mdl.X`.

The posterior probability p(j|`Xnew`) is

`$p\left(j|\text{Xnew}\right)=\frac{\sum _{i\in \text{nbd}}W\left(i\right){1}_{Y\left(X\left(i\right)=j\right)}}{\sum _{i\in \text{nbd}}W\left(i\right)}.$`

Here, ${1}_{Y\left(X\left(i\right)=j\right)}$ means `1` when `mdl.Y(i) = j`, and `0` otherwise.

### True Misclassification Cost

There are two costs associated with KNN classification: the true misclassification cost per class, and the expected misclassification cost per observation.

You can set the true misclassification cost per class in the `Cost` name-value pair when you run `fitcknn`. `Cost(i,j)` is the cost of classifying an observation into class `j` if its true class is `i`. By default, `Cost(i,j)=1` if `i~=j`, and `Cost(i,j)=0` if `i=j`. In other words, the cost is `0` for correct classification, and `1` for incorrect classification.

### Expected Cost

There are two costs associated with KNN classification: the true misclassification cost per class, and the expected misclassification cost per observation. The third output of `predict` is the expected misclassification cost per observation.

Suppose you have `Nobs` observations that you want to classify with a trained classifier `mdl`. Suppose you have `K` classes. You place the observations into a matrix `Xnew` with one observation per row. The command

`[label,score,cost] = predict(mdl,Xnew)`

returns, among other outputs, a `cost` matrix of size `Nobs`-by-`K`. Each row of the `cost` matrix contains the expected (average) cost of classifying the observation into each of the `K` classes. `cost(n,k)` is

`$\sum _{i=1}^{K}\stackrel{^}{P}\left(i|Xnew\left(n\right)\right)C\left(k|i\right),$`

where

• K is the number of classes.

• $\stackrel{^}{P}\left(i|Xnew\left(n\right)\right)$ is the posterior probability of class i for observation Xnew(n).

• $C\left(k|i\right)$ is the true misclassification cost of classifying an observation as k when its true class is i.

## Examples

expand all

Construct a k-nearest neighbor classifier for Fisher's iris data, where k = 5. Evaluate some model predictions on new data.

```load fisheriris X = meas; Y = species; ```

Construct a classifier for 5-nearest neighbors. It is good practice to standardize non-categorical predictor data.

```mdl = fitcknn(X,Y,'NumNeighbors',5,'Standardize',1); ```

Predict the classifications for flowers with minimum, mean, and maximum characteristics.

```Xnew = [min(X);mean(X);max(X)]; [label,score,cost] = predict(mdl,Xnew) ```
```label = 3×1 cell array 'versicolor' 'versicolor' 'virginica' score = 0.4000 0.6000 0 0 1.0000 0 0 0 1.0000 cost = 0.6000 0.4000 1.0000 1.0000 0 1.0000 1.0000 1.0000 0 ```

The classifications have binary values for the score and cost matrices, meaning all five nearest neighbors of each of the three points have identical classifications.