Accelerating the pace of engineering and science

fitcknn

Fit k-nearest neighbor classifier

Description

example

mdl = fitcknn(X,y) returns a classification model based on the input variables (also known as predictors, features, or attributes) X and output (response) y.

example

mdl = fitcknn(X,y,Name,Value) fits a model with additional options specified by one or more name-value pair arguments. For example, you can specify the tie-breaking algorithm, distance metric, or observation weights.

Examples

expand all

Train a k-Nearest Neighbor Classifier

Construct a k-nearest neighbor classifier for Fisher's iris data, where k, the number of nearest neighbors in the predictors, is 5.

load fisheriris
X = meas;
Y = species;


X is a numeric matrix that contains four petal measurements for 150 irises. Y is a cell array of strings that contains the corresponding iris species.

Train a 5-nearest neighbors classifier. It is good practice to standardize noncategorical predictor data.

Mdl = fitcknn(X,Y,'NumNeighbors',5,'Standardize',1)

Mdl =

ClassificationKNN
PredictorNames: {'x1'  'x2'  'x3'  'x4'}
ResponseName: 'Y'
ClassNames: {'setosa'  'versicolor'  'virginica'}
ScoreTransform: 'none'
NumObservations: 150
Distance: 'euclidean'
NumNeighbors: 5



Mdl is a trained ClassificationKNN classifier, and some of its properties display in the Command Window.

To access the properties of Mdl, use dot notation.

Mdl.ClassNames
Mdl.Prior

ans =

'setosa'
'versicolor'
'virginica'

ans =

0.3333    0.3333    0.3333



Mdl.Prior contains the class prior probabilities, which are settable using the name-value pair argument 'Prior' in fitcknn. The order of the class prior probabilities corresponds to the order of the classes in Mdl.ClassNames. By default, the prior probabilities are the respective relative frequencies of the classes in the data.

You can also reset the prior probabilities after training. For example, set the prior probabilities to 0.5, 0.2, and 0.3 respectively.

Mdl.Prior = [0.5 0.2 0.3];


You can pass Mdl to, for example, ClassificationKNN.predict to label new measurements, or ClassificationKNN.crossval to cross validate the classifier.

Train a k-Nearest Neighbor Classifier Using the Minkowski Metric

load fisheriris
X = meas;
Y = species;


X is a numeric matrix that contains four petal measurements for 150 irises. Y is a cell array of strings that contains the corresponding iris species.

Train a 3-nearest neighbors classifier using the Minkowski metric. To use the Minkowski metric, you must use an exhaustive searcher. It is good practice to standardize noncategorical predictor data.

Mdl = fitcknn(X,Y,'NumNeighbors',3,...
'NSMethod','exhaustive','Distance','minkowski',...
'Standardize',1);


Mdl is a ClassificationKNN classifier.

You can examine the properties of Mdl by double-clicking Mdl in the Workspace window. This opens the Variable Editor.

Train a k-Nearest Neighbor Classifier Using a Custom Distance Metric

Train a k-nearest neighbor classifier using the chi-square distance.

load fisheriris
X = meas;    % Predictors
Y = species; % Response


The chi-square distance between j-dimensional points x and z is

where is a weight associated with dimension j.

Specify the chi-square distance function. The distance function must:

• Take one row of X, e.g., x, and the matrix Z.

• Compare x to each row of Z.

• Return a vector D of length , where is the number of rows of Z. Each element of D is the distance between the observation corresponding to x and the observations corresponding to each row of Z.

chiSqrDist = @(x,Z,wt)sqrt((bsxfun(@minus,x,Z).^2)*wt);


This example uses arbitrtary weights for illustration.

Train a 3-nearest neighbor classifier. It is good practoce to standardize noncategorical predictor data.

k = 3;
w = [0.3; 0.3; 0.2; 0.2];
KNNMdl = fitcknn(X,Y,'Distance',@(x,Z)chiSqrDist(x,Z,w),...
'NumNeighbors',k,'Standardize',1);


KNNMdl is a ClassificationKNN classifier.

Cross validate the KNN classifier using the default 10-fold cross validation. Examine the classification error.

rng(1); % For reproducibility
CVKNNMdl = crossval(KNNMdl);
classError = kfoldLoss(CVKNNMdl)

classError =

0.0600



CVKNNMdl is a ClassificationPartitionedModel classifier. The 10-fold classification error is 4%.

Compare the classifier with one that uses a different weighting scheme.

w2 = [0.2; 0.2; 0.3; 0.3];
CVKNNMdl2 = fitcknn(X,Y,'Distance',@(x,Z)chiSqrDist(x,Z,w2),...
'NumNeighbors',k,'KFold',10,'Standardize',1);
classError2 = kfoldLoss(CVKNNMdl2)

classError2 =

0.0400



The second weighting scheme yields a classifier that has better out-of-sample performance.

Input Arguments

expand all

X — Predictor valuesnumeric matrix

Predictor values, specified as a numeric matrix. Each column of X represents one variable, and each row represents one observation.

Data Types: single | double

y — Classification valuesnumeric vector | categorical vector | logical vector | character array | cell array of strings

Classification values, specified as a numeric vector, categorical vector, logical vector, character array, or cell array of strings, with the same number of rows as X. Each row of y represents the classification of the corresponding row of X.

Data Types: single | double | cell | logical | char

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'NumNeighbors',3,'NSMethod','exhaustive','Distance','minkowski' specifies a classifier for three-nearest neighbors using the nearest neighbor search method and the Minkowski metric.

'BreakTies' — Tie-breaking algorithm'smallest' (default) | 'nearest' | 'random'

Tie-breaking algorithm used by the predict method if multiple classes have the same smallest cost, specified as the comma-separated pair consisting of 'BreakTies' and one of the following:

• 'smallest' — Use the smallest index among tied groups.

• 'nearest' — Use the class with the nearest neighbor among tied groups.

• 'random' — Use a random tiebreaker among tied groups.

By default, ties occur when multiple classes have the same number of nearest points among the K nearest neighbors.

Example: 'BreakTies','nearest'

'BucketSize' — Maximum data points in node50 (default) | positive integer value

Maximum number of data points in the leaf node of the kd-tree, specified as the comma-separated pair consisting of 'BucketSize' and a positive integer value. This argument is meaningful only when NSMethod is 'kdtree'.

Example: 'BucketSize',40

Data Types: single | double

'CategoricalPredictors' — Categorical predictor flag[] (default) | 'all'

Categorical predictor flag, specified as the comma-separated pair consisting of 'CategoricalPredictors' and one of the following:

• 'all' — All predictors are categorical.

• [] — No predictors are categorical.

When you set CategoricalPredictors to 'all', the default Distance is 'hamming'.

Example: 'CategoricalPredictors','all'

'ClassNames' — Class namesnumeric vector | categorical vector | logical vector | character array | cell array of strings

Class names, specified as the comma-separated pair consisting of 'ClassNames' and an array representing the class names. Use the same data type as the values that exist in y.

Use ClassNames to order the classes or to select a subset of classes for training. The default is the class names in y.

Data Types: single | double | char | logical | cell

'Cost' — Cost of misclassificationsquare matrix | structure

Cost of misclassification of a point, specified as the comma-separated pair consisting of 'Cost' and one of the following:

• Square matrix, where Cost(i,j) is the cost of classifying a point into class j if its true class is i.

• Structure S having two fields: S.ClassNames containing the group names as a variable of the same type as y, and S.ClassificationCosts containing the cost matrix.

The default is Cost(i,j)=1 if i~=j, and Cost(i,j)=0 if i=j.

Data Types: single | double | struct

'Cov' — Covariance matrixnancov(X) (default) | positive definite matrix of scalar values

Covariance matrix, specified as the comma-separated pair consisting of 'Cov' and a positive definite matrix of scalar values representing the covariance matrix when computing the Mahalanobis distance. This argument is only valid when 'Distance' is 'mahalanobis'.

You cannot simultaneously specify 'Standardize' and either of 'Scale' or 'Cov'.

Data Types: single | double

'CrossVal' — Cross-validation flag'off' (default) | 'on'

Cross-validation flag, specified as the comma-separated pair consisting of 'CrossVal' and either 'on' or 'off'. If 'on', fitcknn creates a cross-validated model with 10 folds. Use the 'KFold', 'Holdout', 'Leaveout', or 'CVPartition' parameters to override this cross-validation setting. You can only use one parameter at a time to create a cross-validated model.

Alternatively, cross validate mdl later using the crossval method.

Example: 'Crossval','on'

'CVPartition' — Cross-validated model partitioncvpartition object

Cross-validated model partition, specified as the comma-separated pair consisting of 'CVPartition' and an object created using cvpartition. You can only use one of these four options at a time to create a cross-validated model: 'KFold', 'Holdout', 'Leaveout', or 'CVPartition'.

'Distance' — Distance metricvalid distance metric string | function handle

Distance metric, specified as the comma-separated pair consisting of 'Distance' and a valid distance metric string or function handle. The allowable strings depend on the NSMethod parameter, which you set in fitcknn, and which exists as a field in ModelParameters. If you specify CategoricalPredictors as 'all', then the default distance metric is 'hamming'. Otherwise, the default distance metric is 'euclidean'.

NSMethodDistance Metric Names
exhaustiveAny distance metric of ExhaustiveSearcher
kdtree'cityblock', 'chebychev', 'euclidean', or 'minkowski'

For definitions, see Distance Metrics.

This table includes valid distance metrics of ExhaustiveSearcher.

ValueDescription
'cityblock'City block distance.
'chebychev'Chebychev distance (maximum coordinate difference).
'correlation'One minus the sample linear correlation between observations (treated as sequences of values).
'cosine'One minus the cosine of the included angle between observations (treated as vectors).
'euclidean'Euclidean distance.
'hamming'Hamming distance, percentage of coordinates that differ.
'jaccard'One minus the Jaccard coefficient, the percentage of nonzero coordinates that differ.
'mahalanobis'Mahalanobis distance, computed using a positive definite covariance matrix C. The default value of C is the sample covariance matrix of X, as computed by nancov(X). To specify a different value for C, use the 'Cov' name-value pair argument.
'minkowski'Minkowski distance. The default exponent is 2. To specify a different exponent, use the 'Exponent' name-value pair argument.
'seuclidean'Standardized Euclidean distance. Each coordinate difference between X and a query point is scaled, meaning divided by a scale value S. The default value of S is the standard deviation computed from X, S = nanstd(X). To specify another value for S, use the Scale name-value pair argument.
'spearman'One minus the sample Spearman's rank correlation between observations (treated as sequences of values).
@distfunDistance function handle. distfun has the form
function D2 = DISTFUN(ZI,ZJ)
% calculation of  distance
...
where
• ZI is a 1-by-N vector containing one row of X or y.

• ZJ is an M2-by-N matrix containing multiple rows of X or y.

• D2 is an M2-by-1 vector of distances, and D2(k) is the distance between observations ZI and ZJ(J,:).

Example: 'Distance','minkowski'

Data Types: function_handle

'DistanceWeight' — Distance weighting function'equal' (default) | 'inverse' | 'squaredinverse' | function handle

Distance weighting function, specified as the comma-separated pair consisting of 'DistanceWeight' and either a function handle or one of the following strings specifying the distance weighting function.

DistanceWeightMeaning
'equal'No weighting
'inverse'Weight is 1/distance
'squaredinverse'Weight is 1/distance2
@fcnfcn is a function that accepts a matrix of nonnegative distances, and returns a matrix the same size containing nonnegative distance weights. For example, 'squaredinverse' is equivalent to @(d)d.^(-2).

Example: 'DistanceWeight','inverse'

Data Types: function_handle

'Exponent' — Minkowski distance exponent2 (default) | positive scalar value

Minkowski distance exponent, specified as the comma-separated pair consisting of 'Exponent' and a positive scalar value. This argument is only valid when 'Distance' is 'minkowski'.

Example: 'Exponent',3

Data Types: single | double

'Holdout' — Fraction of data for holdout validation0 (default) | scalar value in the range [0,1]

Fraction of data used for holdout validation, specified as the comma-separated pair consisting of 'Holdout' and a scalar value in the range [0,1]. Holdout validation tests the specified fraction of the data, and uses the remaining data for training.

If you use Holdout, you cannot use any of the 'CVPartition', 'KFold', or 'Leaveout' name-value pair arguments.

Example: 'Holdout',0.1

Data Types: single | double

'IncludeTies' — Tie inclusion flagfalse (default) | true

Tie inclusion flag, specified as the comma-separated pair consisting of 'IncludeTies' and a logical value indicating whether predict includes all the neighbors whose distance values are equal to the Kth smallest distance. If IncludeTies is true, predict includes all these neighbors. Otherwise, predict uses exactly K neighbors.

Example: 'IncludeTies',true

Data Types: logical

'KFold' — Number of folds10 (default) | positive integer value

Number of folds to use in a cross-validated model, specified as the comma-separated pair consisting of 'KFold' and a positive integer value.

If you use 'KFold', you cannot use any of the 'CVPartition', 'Holdout', or 'Leaveout' name-value pair arguments.

Example: 'KFold',8

Data Types: single | double

'Leaveout' — Leave-one-out cross-validation flag'off' (default) | 'on'

Leave-one-out cross-validation flag, specified as the comma-separated pair consisting of 'Leaveout' and either 'on' or 'off'. Specify 'on' to use leave-one-out cross validation.

If you use 'Leaveout', you cannot use any of the 'CVPartition', 'Holdout', or 'KFold' name-value pair arguments.

Example: 'Leaveout','on'

'NSMethod' — Nearest neighbor search method'kdtree' | 'exhaustive'

Nearest neighbor search method, specified as the comma-separated pair consisting of 'NSMethod' and 'kdtree' or 'exhaustive'.

• 'kdtree' — Create and use a kd-tree to find nearest neighbors. 'kdtree' is valid when the distance metric is one of the following:

• 'euclidean'

• 'cityblock'

• 'minkowski'

• 'chebyshev'

• 'exhaustive' — Use the exhaustive search algorithm. The distance values from all points in X to each point in y are computed to find nearest neighbors.

The default is 'kdtree' when X has 10 or fewer columns, X is not sparse, and the distance metric is a 'kdtree' type; otherwise, 'exhaustive'.

Example: 'NSMethod','exhaustive'

'NumNeighbors' — Number of nearest neighbors to find1 (default) | positive integer value

Number of nearest neighbors in X to find for classifying each point when predicting, specified as the comma-separated pair consisting of 'NumNeighbors' and a positive integer value.

Example: 'NumNeighbors',3

Data Types: single | double

'PredictorNames' — Predictor variable names{'x1','x2',...} (default) | cell array of strings

Predictor variable names, specified as the comma-separated pair consisting of 'PredictorNames' and a cell array of strings containing the names for the predictor variables, in the order in which they appear in X.

Data Types: cell

'Prior' — Prior probabilities'empirical' (default) | 'uniform' | vector of scalar values | structure

Prior probabilities for each class, specified as the comma-separated pair consisting of 'Prior' and one of the following.

• A string:

• 'empirical' determines class probabilities from class frequencies in y. If you pass observation weights, they are used to compute the class probabilities.

• 'uniform' sets all class probabilities equal.

• A vector (one scalar value for each class).

• A structure S with two fields:

• S.ClassNames containing the class names as a variable of the same type as y

• S.ClassProbs containing a vector of corresponding probabilities

If you set values for both Weights and Prior, the weights are renormalized to add up to the value of the prior probability in the respective class.

Example: 'Prior','uniform'

Data Types: single | double | struct

'ResponseName' — Response variable name'Y' (default) | string

Response variable name, specified as the comma-separated pair consisting of 'ResponseName' and a string containing the name of the response variable y.

Example: 'ResponseName','Response'

Data Types: char

'Scale' — Distance scalenanstd(X) (default) | vector of nonnegative scalar values

Distance scale, specified as the comma-separated pair consisting of 'Scale' and a vector containing nonnegative scalar values with length equal to the number of columns in X. Each coordinate difference between X and a query point is scaled by the corresponding element of Scale. This argument is only valid when 'Distance' is 'seuclidean'.

You cannot simultaneously specify 'Standardize' and either of 'Scale' or 'Cov'.

Data Types: single | double

'ScoreTransform' — Score transform function'none' (default) | 'doublelogit' | 'invlogit' | 'ismax' | 'logit' | 'sign' | 'symmetric' | 'symmetriclogit' | 'symmetricismax' | function handle

Score transform function, specified as the comma-separated pair consisting of 'ScoreTransform' and a string or function handle.

• If the value is a string, then it must correspond to a built-in function. This table summarizes the available, built-in functions.

StringFormula
'doublelogit'1/(1 + e–2x)
'invlogit'log(x / (1–x))
'ismax'Set the score for the class with the largest score to 1, and scores for all other classes to 0.
'logit'1/(1 + ex)
'none'x (no transformation)
'sign'–1 for x < 0
0 for x = 0
1 for x > 0
'symmetric'2x – 1
'symmetriclogit'2/(1 + ex) – 1
'symmetricismax'Set the score for the class with the largest score to 1, and scores for all other classes to -1.

• For a MATLAB® function, or a function that you define, enter its function handle.

Mdl.ScoreTransform = @function;

function should accept a matrix (the original scores) and return a matrix of the same size (the transformed scores).

Example: 'ScoreTransform','sign'

Data Types: char | function_handle

'Standardize' — Flag to standardize predictorsfalse (default) | true

Flag to standardize the predictors, specified as the comma-separated pair consisting of 'Standardize' and true (1) or false (0).

If you set 'Standardize',true, then the software centers and scales each column of the predictor data (X) by the column mean and standard deviation, respectively.

The software does not standardize categorical predictors, and throws an error if all predictors are categorical.

You cannot simultaneously specify 'Standardize',1 and either of 'Scale' or 'Cov'.

It is good practice to standardize the predictor data.

Example: 'Standardize',true

Data Types: logical

'Weights' — Observation weightsones(size(X,1),1) (default) | vector of scalar values

Observation weights, specified as the comma-separated pair consisting of 'Weights' and a vector of scalar values. The length of Weights is the number of rows in X.

The software normalizes the weights in each class to add up to the value of the prior probability of the class.

Data Types: single | double

Output Arguments

expand all

mdl — Classifier modelclassifier model object

k-nearest neighbor classifier model, returned as a classifier model object.

Note that using the 'CrossVal', 'KFold', 'Holdout', 'Leaveout', or 'CVPartition' options results in a model of class ClassificationPartitionedModel. You cannot use a partitioned tree for prediction, so this kind of tree does not have a predict method.

Otherwise, mdl is of class ClassificationKNN, and you can use the predict method to make predictions.

Alternatives

Although fitcknn can train a multiclass KNN classifier, you can reduce a multiclass learning problem to a series of KNN binary learners using fitcecoc.

expand all

Prediction

ClassificationKNN predicts the classification of a point Xnew using a procedure equivalent to this:

1. Find the NumNeighbors points in the training set X that are nearest to Xnew.

2. Find the NumNeighbors response values Y to those nearest points.

3. Assign the classification label Ynew that has the largest posterior probability among the values in Y.

For details, see Posterior Probability in the predict documentation.

Algorithms

• NaNs or <undefined>s indicate missing observations. The following describes the behavior of fitcknn when the data set or weights contain missing observations.

• If any value of y or any weight is missing, then fitcknn removes those values from y, the weights, and the corresponding rows of X from the data. The software renormalizes the weights to sum to 1.

• If you specify to standardize predictors ('Standardize',1) or the standardized Euclidean distance ('Distance','seuclidean') without a scale, then fitcknn removes missing observations from individual predictors before computing the mean and standard deviation. In other words, the software implements nanmean and nanstd on each predictor.

• If you specify the Mahalanobis distance ('Distance','mahalanbois') without its covariance matrix, then fitcknn removes rows of X that contain at least one missing value. In other words, the software implements nancov on the predictor matrix X.

• Suppose that you set 'Standardize',1.

• If you also specify Prior or Weights, then the software takes the observation weights into account. Specifically, the weighted mean of predictor j is

${\overline{x}}_{j}=\sum _{{B}_{j}}^{}{w}_{k}{x}_{jk}$

and the weighted standard deviation is

${s}_{j}=\sum _{Bj}^{}{w}_{k}\left({x}_{jk}-{\overline{x}}_{j}\right),$

where Bj is the set of indices k for which xjk and wk are not missing.

• If you also set 'Distance','mahalanobis' or 'Distance','seuclidean', then you cannot specify Scale or Cov. Instead, the software:

1. Computes the means and standard deviations of each predictor

2. Standardizes the data using the results of step 1

3. Computes the distance parameter values using their respective default.

• If you specify Scale and either of Prior or Weights, then the software scales observed distances by the weighted standard deviations.

• If you specify Cov and either of Prior or Weights, then the software applies the weighted covariance matrix to the distances. In other words,

$Cov=\frac{\sum _{B}{w}_{j}}{{\left(\sum _{B}{w}_{j}\right)}^{2}-\sum _{B}{w}_{j}^{2}}\sum _{B}^{}{w}_{j}{\left({x}_{j}-\overline{x}\right)}^{\prime }\left({x}_{j}-\overline{x}\right),$

where B is the set of indices j for which the observation xj does not have any missing values and wj is not missing.