Documentation 
Fit knearest neighbor classifier
mdl = fitcknn(X,y) returns a classification model based on the input variables (also known as predictors, features, or attributes) X and output (response) y.
mdl = fitcknn(X,y,Name,Value) fits a model with additional options specified by one or more namevalue pair arguments. For example, you can specify the tiebreaking algorithm, distance metric, or observation weights.
Construct a knearest neighbor classifier for Fisher's iris data, where k, the number of nearest neighbors in the predictors, is 5.
Load Fisher's iris data.
load fisheriris
X = meas;
Y = species;
X is a numeric matrix that contains four petal measurements for 150 irises. Y is a cell array of strings that contains the corresponding iris species.
Train a 5nearest neighbors classifier. It is good practice to standardize noncategorical predictor data.
Mdl = fitcknn(X,Y,'NumNeighbors',5,'Standardize',1)
Mdl = ClassificationKNN PredictorNames: {'x1' 'x2' 'x3' 'x4'} ResponseName: 'Y' ClassNames: {'setosa' 'versicolor' 'virginica'} ScoreTransform: 'none' NumObservations: 150 Distance: 'euclidean' NumNeighbors: 5
Mdl is a trained ClassificationKNN classifier, and some of its properties display in the Command Window.
To access the properties of Mdl, use dot notation.
Mdl.ClassNames Mdl.Prior
ans = 'setosa' 'versicolor' 'virginica' ans = 0.3333 0.3333 0.3333
Mdl.Prior contains the class prior probabilities, which are settable using the namevalue pair argument 'Prior' in fitcknn. The order of the class prior probabilities corresponds to the order of the classes in Mdl.ClassNames. By default, the prior probabilities are the respective relative frequencies of the classes in the data.
You can also reset the prior probabilities after training. For example, set the prior probabilities to 0.5, 0.2, and 0.3 respectively.
Mdl.Prior = [0.5 0.2 0.3];
You can pass Mdl to, for example, ClassificationKNN.predict to label new measurements, or ClassificationKNN.crossval to cross validate the classifier.
Load Fisher's iris data set.
load fisheriris
X = meas;
Y = species;
X is a numeric matrix that contains four petal measurements for 150 irises. Y is a cell array of strings that contains the corresponding iris species.
Train a 3nearest neighbors classifier using the Minkowski metric. To use the Minkowski metric, you must use an exhaustive searcher. It is good practice to standardize noncategorical predictor data.
Mdl = fitcknn(X,Y,'NumNeighbors',3,... 'NSMethod','exhaustive','Distance','minkowski',... 'Standardize',1);
Mdl is a ClassificationKNN classifier.
You can examine the properties of Mdl by doubleclicking Mdl in the Workspace window. This opens the Variable Editor.
Train a knearest neighbor classifier using the chisquare distance.
Load Fisher's iris data set.
load fisheriris X = meas; % Predictors Y = species; % Response
The chisquare distance between jdimensional points x and z is
where is a weight associated with dimension j.
Specify the chisquare distance function. The distance function must:
Take one row of X, e.g., x, and the matrix Z.
Compare x to each row of Z.
Return a vector D of length , where is the number of rows of Z. Each element of D is the distance between the observation corresponding to x and the observations corresponding to each row of Z.
chiSqrDist = @(x,Z,wt)sqrt((bsxfun(@minus,x,Z).^2)*wt);
This example uses arbitrtary weights for illustration.
Train a 3nearest neighbor classifier. It is good practoce to standardize noncategorical predictor data.
k = 3; w = [0.3; 0.3; 0.2; 0.2]; KNNMdl = fitcknn(X,Y,'Distance',@(x,Z)chiSqrDist(x,Z,w),... 'NumNeighbors',k,'Standardize',1);
KNNMdl is a ClassificationKNN classifier.
Cross validate the KNN classifier using the default 10fold cross validation. Examine the classification error.
rng(1); % For reproducibility
CVKNNMdl = crossval(KNNMdl);
classError = kfoldLoss(CVKNNMdl)
classError = 0.0600
CVKNNMdl is a ClassificationPartitionedModel classifier. The 10fold classification error is 4%.
Compare the classifier with one that uses a different weighting scheme.
w2 = [0.2; 0.2; 0.3; 0.3]; CVKNNMdl2 = fitcknn(X,Y,'Distance',@(x,Z)chiSqrDist(x,Z,w2),... 'NumNeighbors',k,'KFold',10,'Standardize',1); classError2 = kfoldLoss(CVKNNMdl2)
classError2 = 0.0400
The second weighting scheme yields a classifier that has better outofsample performance.
Predictor values, specified as a numeric matrix. Each column of X represents one variable, and each row represents one observation.
Data Types: single  double
Classification values, specified as a numeric vector, categorical vector, logical vector, character array, or cell array of strings, with the same number of rows as X. Each row of y represents the classification of the corresponding row of X.
Data Types: single  double  cell  logical  char
Specify optional commaseparated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.
Example: 'NumNeighbors',3,'NSMethod','exhaustive','Distance','minkowski' specifies a classifier for threenearest neighbors using the nearest neighbor search method and the Minkowski metric.Tiebreaking algorithm used by the predict method if multiple classes have the same smallest cost, specified as the commaseparated pair consisting of 'BreakTies' and one of the following:
'smallest' — Use the smallest index among tied groups.
'nearest' — Use the class with the nearest neighbor among tied groups.
'random' — Use a random tiebreaker among tied groups.
By default, ties occur when multiple classes have the same number of nearest points among the K nearest neighbors.
Example: 'BreakTies','nearest'
Maximum number of data points in the leaf node of the kdtree, specified as the commaseparated pair consisting of 'BucketSize' and a positive integer value. This argument is meaningful only when NSMethod is 'kdtree'.
Example: 'BucketSize',40
Data Types: single  double
Categorical predictor flag, specified as the commaseparated pair consisting of 'CategoricalPredictors' and one of the following:
'all' — All predictors are categorical.
[] — No predictors are categorical.
When you set CategoricalPredictors to 'all', the default Distance is 'hamming'.
Example: 'CategoricalPredictors','all'
Class names, specified as the commaseparated pair consisting of 'ClassNames' and an array representing the class names. Use the same data type as the values that exist in y.
Use ClassNames to order the classes or to select a subset of classes for training. The default is the class names in y.
Data Types: single  double  char  logical  cell
Cost of misclassification of a point, specified as the commaseparated pair consisting of 'Cost' and one of the following:
Square matrix, where Cost(i,j) is the cost of classifying a point into class j if its true class is i.
Structure S having two fields: S.ClassNames containing the group names as a variable of the same type as y, and S.ClassificationCosts containing the cost matrix.
The default is Cost(i,j)=1 if i~=j, and Cost(i,j)=0 if i=j.
Data Types: single  double  struct
Covariance matrix, specified as the commaseparated pair consisting of 'Cov' and a positive definite matrix of scalar values representing the covariance matrix when computing the Mahalanobis distance. This argument is only valid when 'Distance' is 'mahalanobis'.
You cannot simultaneously specify 'Standardize' and either of 'Scale' or 'Cov'.
Data Types: single  double
Crossvalidation flag, specified as the commaseparated pair consisting of 'CrossVal' and either 'on' or 'off'. If 'on', fitcknn creates a crossvalidated model with 10 folds. Use the 'KFold', 'Holdout', 'Leaveout', or 'CVPartition' parameters to override this crossvalidation setting. You can only use one parameter at a time to create a crossvalidated model.
Alternatively, cross validate mdl later using the crossval method.
Example: 'Crossval','on'
Crossvalidated model partition, specified as the commaseparated pair consisting of 'CVPartition' and an object created using cvpartition. You can only use one of these four options at a time to create a crossvalidated model: 'KFold', 'Holdout', 'Leaveout', or 'CVPartition'.
Distance metric, specified as the commaseparated pair consisting of 'Distance' and a valid distance metric string or function handle. The allowable strings depend on the NSMethod parameter, which you set in fitcknn, and which exists as a field in ModelParameters. If you specify CategoricalPredictors as 'all', then the default distance metric is 'hamming'. Otherwise, the default distance metric is 'euclidean'.
NSMethod  Distance Metric Names 

exhaustive  Any distance metric of ExhaustiveSearcher 
kdtree  'cityblock', 'chebychev', 'euclidean', or 'minkowski' 
For definitions, see Distance Metrics.
This table includes valid distance metrics of ExhaustiveSearcher.
Value  Description 

'cityblock'  City block distance. 
'chebychev'  Chebychev distance (maximum coordinate difference). 
'correlation'  One minus the sample linear correlation between observations (treated as sequences of values). 
'cosine'  One minus the cosine of the included angle between observations (treated as vectors). 
'euclidean'  Euclidean distance. 
'hamming'  Hamming distance, percentage of coordinates that differ. 
'jaccard'  One minus the Jaccard coefficient, the percentage of nonzero coordinates that differ. 
'mahalanobis'  Mahalanobis distance, computed using a positive definite covariance matrix C. The default value of C is the sample covariance matrix of X, as computed by nancov(X). To specify a different value for C, use the 'Cov' namevalue pair argument. 
'minkowski'  Minkowski distance. The default exponent is 2. To specify a different exponent, use the 'Exponent' namevalue pair argument. 
'seuclidean'  Standardized Euclidean distance. Each coordinate difference between X and a query point is scaled, meaning divided by a scale value S. The default value of S is the standard deviation computed from X, S = nanstd(X). To specify another value for S, use the Scale namevalue pair argument. 
'spearman'  One minus the sample Spearman's rank correlation between observations (treated as sequences of values). 
@distfun  Distance function handle. distfun has
the formfunction D2 = DISTFUN(ZI,ZJ) % calculation of distance ...where

Example: 'Distance','minkowski'
Data Types: function_handle
Distance weighting function, specified as the commaseparated pair consisting of 'DistanceWeight' and either a function handle or one of the following strings specifying the distance weighting function.
DistanceWeight  Meaning 

'equal'  No weighting 
'inverse'  Weight is 1/distance 
'squaredinverse'  Weight is 1/distance^{2} 
@fcn  fcn is a function that accepts a matrix of nonnegative distances, and returns a matrix the same size containing nonnegative distance weights. For example, 'squaredinverse' is equivalent to @(d)d.^(2). 
Example: 'DistanceWeight','inverse'
Data Types: function_handle
Minkowski distance exponent, specified as the commaseparated pair consisting of 'Exponent' and a positive scalar value. This argument is only valid when 'Distance' is 'minkowski'.
Example: 'Exponent',3
Data Types: single  double
Fraction of data used for holdout validation, specified as the commaseparated pair consisting of 'Holdout' and a scalar value in the range [0,1]. Holdout validation tests the specified fraction of the data, and uses the remaining data for training.
If you use Holdout, you cannot use any of the 'CVPartition', 'KFold', or 'Leaveout' namevalue pair arguments.
Example: 'Holdout',0.1
Data Types: single  double
Tie inclusion flag, specified as the commaseparated pair consisting of 'IncludeTies' and a logical value indicating whether predict includes all the neighbors whose distance values are equal to the Kth smallest distance. If IncludeTies is true, predict includes all these neighbors. Otherwise, predict uses exactly K neighbors.
Example: 'IncludeTies',true
Data Types: logical
Number of folds to use in a crossvalidated model, specified as the commaseparated pair consisting of 'KFold' and a positive integer value.
If you use 'KFold', you cannot use any of the 'CVPartition', 'Holdout', or 'Leaveout' namevalue pair arguments.
Example: 'KFold',8
Data Types: single  double
Leaveoneout crossvalidation flag, specified as the commaseparated pair consisting of 'Leaveout' and either 'on' or 'off'. Specify 'on' to use leaveoneout cross validation.
If you use 'Leaveout', you cannot use any of the 'CVPartition', 'Holdout', or 'KFold' namevalue pair arguments.
Example: 'Leaveout','on'
Nearest neighbor search method, specified as the commaseparated pair consisting of 'NSMethod' and 'kdtree' or 'exhaustive'.
'kdtree' — Create and use a kdtree to find nearest neighbors. 'kdtree' is valid when the distance metric is one of the following:
'euclidean'
'cityblock'
'minkowski'
'chebyshev'
'exhaustive' — Use the exhaustive search algorithm. The distance values from all points in X to each point in y are computed to find nearest neighbors.
The default is 'kdtree' when X has 10 or fewer columns, X is not sparse, and the distance metric is a 'kdtree' type; otherwise, 'exhaustive'.
Example: 'NSMethod','exhaustive'
Number of nearest neighbors in X to find for classifying each point when predicting, specified as the commaseparated pair consisting of 'NumNeighbors' and a positive integer value.
Example: 'NumNeighbors',3
Data Types: single  double
Predictor variable names, specified as the commaseparated pair consisting of 'PredictorNames' and a cell array of strings containing the names for the predictor variables, in the order in which they appear in X.
Data Types: cell
Prior probabilities for each class, specified as the commaseparated pair consisting of 'Prior' and one of the following.
A string:
'empirical' determines class probabilities from class frequencies in y. If you pass observation weights, they are used to compute the class probabilities.
'uniform' sets all class probabilities equal.
A vector (one scalar value for each class).
A structure S with two fields:
S.ClassNames containing the class names as a variable of the same type as y
S.ClassProbs containing a vector of corresponding probabilities
If you set values for both Weights and Prior, the weights are renormalized to add up to the value of the prior probability in the respective class.
Example: 'Prior','uniform'
Data Types: single  double  struct
Response variable name, specified as the commaseparated pair consisting of 'ResponseName' and a string containing the name of the response variable y.
Example: 'ResponseName','Response'
Data Types: char
Distance scale, specified as the commaseparated pair consisting of 'Scale' and a vector containing nonnegative scalar values with length equal to the number of columns in X. Each coordinate difference between X and a query point is scaled by the corresponding element of Scale. This argument is only valid when 'Distance' is 'seuclidean'.
You cannot simultaneously specify 'Standardize' and either of 'Scale' or 'Cov'.
Data Types: single  double
Score transform function, specified as the commaseparated pair consisting of 'ScoreTransform' and a string or function handle.
If the value is a string, then it must correspond to a builtin function. This table summarizes the available, builtin functions.
String  Formula 

'doublelogit'  1/(1 + e^{–2x}) 
'invlogit'  log(x / (1–x)) 
'ismax'  Set the score for the class with the largest score to 1, and scores for all other classes to 0. 
'logit'  1/(1 + e^{–x}) 
'none'  x (no transformation) 
'sign'  –1 for x < 0 0 for x = 0 1 for x > 0 
'symmetric'  2x – 1 
'symmetriclogit'  2/(1 + e^{–x}) – 1 
'symmetricismax'  Set the score for the class with the largest score to 1, and scores for all other classes to 1. 
For a MATLAB^{®} function, or a function that you define, enter its function handle.
Mdl.ScoreTransform = @function;
function should accept a matrix (the original scores) and return a matrix of the same size (the transformed scores).
Example: 'ScoreTransform','sign'
Data Types: char  function_handle
Flag to standardize the predictors, specified as the commaseparated pair consisting of 'Standardize' and true (1) or false (0).
If you set 'Standardize',true, then the software centers and scales each column of the predictor data (X) by the column mean and standard deviation, respectively.
The software does not standardize categorical predictors, and throws an error if all predictors are categorical.
You cannot simultaneously specify 'Standardize',1 and either of 'Scale' or 'Cov'.
It is good practice to standardize the predictor data.
Example: 'Standardize',true
Data Types: logical
Observation weights, specified as the commaseparated pair consisting of 'Weights' and a vector of scalar values. The length of Weights is the number of rows in X.
The software normalizes the weights in each class to add up to the value of the prior probability of the class.
Data Types: single  double
knearest neighbor classifier model, returned as a classifier model object.
Note that using the 'CrossVal', 'KFold', 'Holdout', 'Leaveout', or 'CVPartition' options results in a model of class ClassificationPartitionedModel. You cannot use a partitioned tree for prediction, so this kind of tree does not have a predict method.
Otherwise, mdl is of class ClassificationKNN, and you can use the predict method to make predictions.
Although fitcknn can train a multiclass KNN classifier, you can reduce a multiclass learning problem to a series of KNN binary learners using fitcecoc.
ClassificationKNN predicts the classification of a point Xnew using a procedure equivalent to this:
Find the NumNeighbors points in the training set X that are nearest to Xnew.
Find the NumNeighbors response values Y to those nearest points.
Assign the classification label Ynew that has the largest posterior probability among the values in Y.
For details, see Posterior Probability in the predict documentation.
NaNs or <undefined>s indicate missing observations. The following describes the behavior of fitcknn when the data set or weights contain missing observations.
If any value of y or any weight is missing, then fitcknn removes those values from y, the weights, and the corresponding rows of X from the data. The software renormalizes the weights to sum to 1.
If you specify to standardize predictors ('Standardize',1) or the standardized Euclidean distance ('Distance','seuclidean') without a scale, then fitcknn removes missing observations from individual predictors before computing the mean and standard deviation. In other words, the software implements nanmean and nanstd on each predictor.
If you specify the Mahalanobis distance ('Distance','mahalanbois') without its covariance matrix, then fitcknn removes rows of X that contain at least one missing value. In other words, the software implements nancov on the predictor matrix X.
Suppose that you set 'Standardize',1.
If you also specify Prior or Weights, then the software takes the observation weights into account. Specifically, the weighted mean of predictor j is
$${\overline{x}}_{j}={\displaystyle \sum}_{{B}_{j}}^{}{w}_{k}{x}_{jk}$$
and the weighted standard deviation is
$${s}_{j}={\displaystyle \sum _{Bj}^{}{w}_{k}}({x}_{jk}{\overline{x}}_{j}),$$
where B_{j} is the set of indices k for which x_{jk} and w_{k} are not missing.
If you also set 'Distance','mahalanobis' or 'Distance','seuclidean', then you cannot specify Scale or Cov. Instead, the software:
Computes the means and standard deviations of each predictor
Standardizes the data using the results of step 1
Computes the distance parameter values using their respective default.
If you specify Scale and either of Prior or Weights, then the software scales observed distances by the weighted standard deviations.
If you specify Cov and either of Prior or Weights, then the software applies the weighted covariance matrix to the distances. In other words,
$$Cov=\frac{{\displaystyle \sum _{B}{w}_{j}}}{{\left({\displaystyle \sum _{B}{w}_{j}}\right)}^{2}{\displaystyle \sum _{B}{w}_{j}^{2}}}{\displaystyle \sum}_{B}^{}{w}_{j}{\left({x}_{j}\overline{x}\right)}^{\prime}\left({x}_{j}\overline{x}\right),$$
where B is the set of indices j for which the observation x_{j} does not have any missing values and w_{j} is not missing.
ClassificationKNN  ClassificationPartitionedModel  fitcecoc  fitensemble  predict  templateKNN