ClassificationKNN.fit

Class: ClassificationKNN

Fit k-nearest neighbor classifier (to be removed)

ClassificationKNN.fit will be removed in a future release. Use fitcknn instead.

Syntax

mdl = ClassificationKNN.fit(X,y)
mdl = ClassificationKNN.fit(X,y,Name,Value)

Description

mdl = ClassificationKNN.fit(X,y) returns a classification model based on the input variables (also known as predictors, features, or attributes) X and output (response) y.

mdl = ClassificationKNN.fit(X,y,Name,Value) fits a model with additional options specified by one or more Name,Value pair arguments.

If you use one of these options, mdl is of class ClassificationPartitionedModel: 'CrossVal', 'KFold', 'Holdout', 'Leaveout', or 'CVPartition'. Otherwise, mdl is of class ClassificationKNN.

Input Arguments

expand all

X — Predictor valuesnumeric matrix

Predictor values, specified as a numeric matrix. Each column of X represents one variable, and each row represents one observation.

Data Types: single | double

y — Classification valuesnumeric vector | categorical vector | logical vector | character array | cell array of strings

Classification values, specified as a numeric vector, categorical vector, logical vector, character array, or cell array of strings, with the same number of rows as X. Each row of y represents the classification of the corresponding row of X.

Data Types: single | double | cell | logical | char

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

'BreakTies' — Tie-breaking algorithm'smallest' (default) | 'nearest' | 'random'

Tie-breaking algorithm used by the predict method if multiple classes have the same smallest cost, specified as the comma-separated pair consisting of 'BreakTies' and one of the following:

  • 'smallest' — Use the smallest index among tied groups.

  • 'nearest' — Use the class with the nearest neighbor among tied groups.

  • 'random' — Use a random tiebreaker among tied groups.

By default, ties occur when multiple classes have the same number of nearest points among the K nearest neighbors.

Example: 'BreakTies','nearest'

'BucketSize' — Maximum data points in node50 (default) | positive integer value

Maximum number of data points in the leaf node of the kd-tree, specified as the comma-separated pair consisting of 'BucketSize' and a positive integer value. This argument is meaningful only when NSMethod is 'kdtree'.

Example: 'BucketSize',40

Data Types: single | double

'CategoricalPredictors' — Categorical predictor flag[] (default) | 'all'

Categorical predictor flag, specified as the comma-separated pair consisting of 'CategoricalPredictors' and one of the following:

  • 'all' — All predictors are categorical.

  • [] — No predictors are categorical.

When you set CategoricalPredictors to 'all', the default Distance is 'hamming'.

Example: 'CategoricalPredictors','all'

'ClassNames' — Class namesnumeric vector | categorical vector | logical vector | character array | cell array of strings

Class names, specified as the comma-separated pair consisting of 'ClassNames' and an array representing the class names. Use the same data type as the values that exist in y.

Use ClassNames to order the classes or to select a subset of classes for training. The default is the class names in y.

Data Types: single | double | char | logical | cell

'Cost' — Cost of misclassificationsquare matrix | structure

Cost of misclassification of a point, specified as the comma-separated pair consisting of 'Cost' and one of the following:

  • Square matrix, where Cost(i,j) is the cost of classifying a point into class j if its true class is i.

  • Structure S having two fields: S.ClassNames containing the group names as a variable of the same type as y, and S.ClassificationCosts containing the cost matrix.

The default is Cost(i,j)=1 if i~=j, and Cost(i,j)=0 if i=j.

Data Types: single | double | struct

'Cov' — Covariance matrixnancov(X) (default) | positive definite matrix of scalar values

Covariance matrix, specified as the comma-separated pair consisting of 'Cov' and a positive definite matrix of scalar values representing the covariance matrix when computing the Mahalanobis distance. This argument is only valid when 'Distance' is 'mahalanobis'.

You cannot simultaneously specify 'Standardize' and either of 'Scale' or 'Cov'.

Data Types: single | double

'CrossVal' — Cross-validation flag'off' (default) | 'on'

Cross-validation flag, specified as the comma-separated pair consisting of 'CrossVal' and either 'on' or 'off'. If 'on', fitcknn creates a cross-validated model with 10 folds. Use the 'KFold', 'Holdout', 'Leaveout', or 'CVPartition' parameters to override this cross-validation setting. You can only use one parameter at a time to create a cross-validated model.

Alternatively, cross validate mdl later using the crossval method.

Example: 'Crossval','on'

'CVPartition' — Cross-validated model partitioncvpartition object

Cross-validated model partition, specified as the comma-separated pair consisting of 'CVPartition' and an object created using cvpartition. You can only use one of these four options at a time to create a cross-validated model: 'KFold', 'Holdout', 'Leaveout', or 'CVPartition'.

'Distance' — Distance metricvalid distance metric string | function handle

Distance metric, specified as the comma-separated pair consisting of 'Distance' and a valid distance metric string or function handle. The allowable strings depend on the NSMethod parameter, which you set in fitcknn, and which exists as a field in ModelParameters. If you specify CategoricalPredictors as 'all', then the default distance metric is 'hamming'. Otherwise, the default distance metric is 'euclidean'.

NSMethodDistance Metric Names
exhaustiveAny distance metric of ExhaustiveSearcher
kdtree'cityblock', 'chebychev', 'euclidean', or 'minkowski'

For definitions, see Distance Metrics.

This table includes valid distance metrics of ExhaustiveSearcher.

ValueDescription
'cityblock'City block distance.
'chebychev'Chebychev distance (maximum coordinate difference).
'correlation'One minus the sample linear correlation between observations (treated as sequences of values).
'cosine'One minus the cosine of the included angle between observations (treated as vectors).
'euclidean'Euclidean distance.
'hamming'Hamming distance, percentage of coordinates that differ.
'jaccard'One minus the Jaccard coefficient, the percentage of nonzero coordinates that differ.
'mahalanobis'Mahalanobis distance, computed using a positive definite covariance matrix C. The default value of C is the sample covariance matrix of X, as computed by nancov(X). To specify a different value for C, use the 'Cov' name-value pair argument.
'minkowski'Minkowski distance. The default exponent is 2. To specify a different exponent, use the 'Exponent' name-value pair argument.
'seuclidean'Standardized Euclidean distance. Each coordinate difference between X and a query point is scaled, meaning divided by a scale value S. The default value of S is the standard deviation computed from X, S = nanstd(X). To specify another value for S, use the Scale name-value pair argument.
'spearman'One minus the sample Spearman's rank correlation between observations (treated as sequences of values).
@distfunDistance function handle. distfun has the form
function D2 = DISTFUN(ZI,ZJ)
% calculation of  distance
...
where
  • ZI is a 1-by-N vector containing one row of X or y.

  • ZJ is an M2-by-N matrix containing multiple rows of X or y.

  • D2 is an M2-by-1 vector of distances, and D2(k) is the distance between observations ZI and ZJ(J,:).

Example: 'Distance','minkowski'

Data Types: function_handle

'DistanceWeight' — Distance weighting function'equal' (default) | 'inverse' | 'squaredinverse' | function handle

Distance weighting function, specified as the comma-separated pair consisting of 'DistanceWeight' and either a function handle or one of the following strings specifying the distance weighting function.

DistanceWeightMeaning
'equal'No weighting
'inverse'Weight is 1/distance
'squaredinverse'Weight is 1/distance2
@fcnfcn is a function that accepts a matrix of nonnegative distances, and returns a matrix the same size containing nonnegative distance weights. For example, 'squaredinverse' is equivalent to @(d)d.^(-2).

Example: 'DistanceWeight','inverse'

Data Types: function_handle

'Exponent' — Minkowski distance exponent2 (default) | positive scalar value

Minkowski distance exponent, specified as the comma-separated pair consisting of 'Exponent' and a positive scalar value. This argument is only valid when 'Distance' is 'minkowski'.

Example: 'Exponent',3

Data Types: single | double

'Holdout' — Fraction of data for holdout validation0 (default) | scalar value in the range [0,1]

Fraction of data used for holdout validation, specified as the comma-separated pair consisting of 'Holdout' and a scalar value in the range [0,1]. Holdout validation tests the specified fraction of the data, and uses the remaining data for training.

If you use Holdout, you cannot use any of the 'CVPartition', 'KFold', or 'Leaveout' name-value pair arguments.

Example: 'Holdout',0.1

Data Types: single | double

'IncludeTies' — Tie inclusion flagfalse (default) | true

Tie inclusion flag, specified as the comma-separated pair consisting of 'IncludeTies' and a logical value indicating whether predict includes all the neighbors whose distance values are equal to the Kth smallest distance. If IncludeTies is true, predict includes all these neighbors. Otherwise, predict uses exactly K neighbors.

Example: 'IncludeTies',true

Data Types: logical

'KFold' — Number of folds10 (default) | positive integer value

Number of folds to use in a cross-validated model, specified as the comma-separated pair consisting of 'KFold' and a positive integer value.

If you use 'KFold', you cannot use any of the 'CVPartition', 'Holdout', or 'Leaveout' name-value pair arguments.

Example: 'KFold',8

Data Types: single | double

'Leaveout' — Leave-one-out cross-validation flag'off' (default) | 'on'

Leave-one-out cross-validation flag, specified as the comma-separated pair consisting of 'Leaveout' and either 'on' or 'off'. Specify 'on' to use leave-one-out cross validation.

If you use 'Leaveout', you cannot use any of the 'CVPartition', 'Holdout', or 'KFold' name-value pair arguments.

Example: 'Leaveout','on'

'NSMethod' — Nearest neighbor search method'kdtree' | 'exhaustive'

Nearest neighbor search method, specified as the comma-separated pair consisting of 'NSMethod' and 'kdtree' or 'exhaustive'.

  • 'kdtree' — Create and use a kd-tree to find nearest neighbors. 'kdtree' is valid when the distance metric is one of the following:

    • 'euclidean'

    • 'cityblock'

    • 'minkowski'

    • 'chebyshev'

  • 'exhaustive' — Use the exhaustive search algorithm. The distance values from all points in X to each point in y are computed to find nearest neighbors.

The default is 'kdtree' when X has 10 or fewer columns, X is not sparse, and the distance metric is a 'kdtree' type; otherwise, 'exhaustive'.

Example: 'NSMethod','exhaustive'

'NumNeighbors' — Number of nearest neighbors to find1 (default) | positive integer value

Number of nearest neighbors in X to find for classifying each point when predicting, specified as the comma-separated pair consisting of 'NumNeighbors' and a positive integer value.

Example: 'NumNeighbors',3

Data Types: single | double

'PredictorNames' — Predictor variable names{'x1','x2',...} (default) | cell array of strings

Predictor variable names, specified as the comma-separated pair consisting of 'PredictorNames' and a cell array of strings containing the names for the predictor variables, in the order in which they appear in X.

Data Types: cell

'Prior' — Prior probabilities'empirical' (default) | 'uniform' | vector of scalar values | structure

Prior probabilities for each class, specified as the comma-separated pair consisting of 'Prior' and one of the following.

  • A string:

    • 'empirical' determines class probabilities from class frequencies in y. If you pass observation weights, they are used to compute the class probabilities.

    • 'uniform' sets all class probabilities equal.

  • A vector (one scalar value for each class).

  • A structure S with two fields:

    • S.ClassNames containing the class names as a variable of the same type as y

    • S.ClassProbs containing a vector of corresponding probabilities

If you set values for both Weights and Prior, the weights are renormalized to add up to the value of the prior probability in the respective class.

Example: 'Prior','uniform'

Data Types: single | double | struct

'ResponseName' — Response variable name'Y' (default) | string

Response variable name, specified as the comma-separated pair consisting of 'ResponseName' and a string containing the name of the response variable y.

Example: 'ResponseName','Response'

Data Types: char

'Scale' — Distance scalenanstd(X) (default) | vector of nonnegative scalar values

Distance scale, specified as the comma-separated pair consisting of 'Scale' and a vector containing nonnegative scalar values with length equal to the number of columns in X. Each coordinate difference between X and a query point is scaled by the corresponding element of Scale. This argument is only valid when 'Distance' is 'seuclidean'.

You cannot simultaneously specify 'Standardize' and either of 'Scale' or 'Cov'.

Data Types: single | double

'Weights' — Observation weightsones(size(X,1),1) (default) | vector of scalar values

Observation weights, specified as the comma-separated pair consisting of 'Weights' and a vector of scalar values. The length of Weights is the number of rows in X.

The software normalizes the weights in each class to add up to the value of the prior probability of the class.

Data Types: single | double

Output Arguments

expand all

mdl — Classifier modelclassifier model object

k-nearest neighbor classifier model, returned as a classifier model object.

Note that using the 'CrossVal', 'KFold', 'Holdout', 'Leaveout', or 'CVPartition' options results in a model of class ClassificationPartitionedModel. You cannot use a partitioned tree for prediction, so this kind of tree does not have a predict method.

Otherwise, mdl is of class ClassificationKNN, and you can use the predict method to make predictions.

Definitions

Prediction

ClassificationKNN predicts the classification of a point Xnew using a procedure equivalent to this:

  1. Find the NumNeighbors points in the training set X that are nearest to Xnew.

  2. Find the NumNeighbors response values Y to those nearest points.

  3. Assign the classification label Ynew that has the largest posterior probability among the values in Y.

For details, see Posterior Probability in the predict documentation.

Examples

expand all

Train a k-Nearest Neighbor Classifier

Construct a k-nearest neighbor classifier for Fisher's iris data, where k, the number of nearest neighbors in the predictors, is 5.

Load Fisher's iris data.

load fisheriris
X = meas;
Y = species;

X is a numeric matrix that contains four petal measurements for 150 irises. Y is a cell array of strings that contains the corresponding iris species.

Train a 5-nearest neighbors classifier. It is good practice to standardize noncategorical predictor data.

Mdl = fitcknn(X,Y,'NumNeighbors',5,'Standardize',1)
Mdl = 

  ClassificationKNN
     PredictorNames: {'x1'  'x2'  'x3'  'x4'}
       ResponseName: 'Y'
         ClassNames: {'setosa'  'versicolor'  'virginica'}
     ScoreTransform: 'none'
    NumObservations: 150
           Distance: 'euclidean'
       NumNeighbors: 5


Mdl is a trained ClassificationKNN classifier, and some of its properties display in the Command Window.

To access the properties of Mdl, use dot notation.

Mdl.ClassNames
Mdl.Prior
ans = 

    'setosa'
    'versicolor'
    'virginica'


ans =

    0.3333    0.3333    0.3333

Mdl.Prior contains the class prior probabilities, which are settable using the name-value pair argument 'Prior' in fitcknn. The order of the class prior probabilities corresponds to the order of the classes in Mdl.ClassNames. By default, the prior probabilities are the respective relative frequencies of the classes in the data.

You can also reset the prior probabilities after training. For example, set the prior probabilities to 0.5, 0.2, and 0.3 respectively.

Mdl.Prior = [0.5 0.2 0.3];

You can pass Mdl to, for example, ClassificationKNN.predict to label new measurements, or ClassificationKNN.crossval to cross validate the classifier.

Train a k-Nearest Neighbor Classifier Using the Minkowski Metric

Load Fisher's iris data set.

load fisheriris
X = meas;
Y = species;

X is a numeric matrix that contains four petal measurements for 150 irises. Y is a cell array of strings that contains the corresponding iris species.

Train a 3-nearest neighbors classifier using the Minkowski metric. To use the Minkowski metric, you must use an exhaustive searcher. It is good practice to standardize noncategorical predictor data.

Mdl = fitcknn(X,Y,'NumNeighbors',3,...
    'NSMethod','exhaustive','Distance','minkowski',...
    'Standardize',1);

Mdl is a ClassificationKNN classifier.

You can examine the properties of Mdl by double-clicking Mdl in the Workspace window. This opens the Variable Editor.

Was this topic helpful?