Documentation Center

  • Trial Software
  • Product Updates

knnclassify

Classify data using nearest neighbor method

Syntax

Class = knnclassify(Sample, Training, Group)
Class = knnclassify(Sample, Training, Group, k)
Class = knnclassify(Sample, Training, Group, k, distance)
Class = knnclassify(Sample, Training, Group, k, distancerule)

Arguments

SampleMatrix whose rows will be classified into groups. Sample must have the same number of columns as Training.
TrainingMatrix used to group the rows in the matrix Sample. Training must have the same number of columns as Sample. Each row of Training belongs to the group whose value is the corresponding entry of Group.
GroupVector whose distinct values define the grouping of the rows in Training.
kThe number of nearest neighbors used in the classification. Default is 1.
distanceString specifying the distance metric. Choices are:
  • 'euclidean' — Euclidean distance (default)

  • 'cityblock' — Sum of absolute differences

  • 'cosine' — One minus the cosine of the included angle between points (treated as vectors)

  • 'correlation' — One minus the sample correlation between points (treated as sequences of values)

  • 'hamming' — Percentage of bits that differ (suitable only for binary data)

ruleString to specify the rule used to decide how to classify the sample. Choices are:
  • 'nearest' — Majority rule with nearest point tie-break (default)

  • 'random' — Majority rule with random point tie-break

  • 'consensus' — Consensus rule

Description

Class = knnclassify(Sample, Training, Group) classifies the rows of the data matrix Sample into groups, based on the grouping of the rows of Training. Sample and Training must be matrices with the same number of columns. Group is a vector whose distinct values define the grouping of the rows in Training. Each row of Training belongs to the group whose value is the corresponding entry of Group. knnclassify assigns each row of Sample to the group for the closest row of Training. Group can be a numeric vector, a string array, or a cell array of strings. Training and Group must have the same number of rows. knnclassify treats NaNs or empty strings in Group as missing values, and ignores the corresponding rows of Training. Class indicates which group each row of Sample has been assigned to, and is of the same type as Group.

Class = knnclassify(Sample, Training, Group, k) enables you to specify k, the number of nearest neighbors used in the classification. Default is 1.

Class = knnclassify(Sample, Training, Group, k, distance) enables you to specify the distance metric. Choices for distance are:

  • 'euclidean' — Euclidean distance (default)

  • 'cityblock' — Sum of absolute differences

  • 'cosine' — One minus the cosine of the included angle between points (treated as vectors)

  • 'correlation' — One minus the sample correlation between points (treated as sequences of values)

  • 'hamming' — Percentage of bits that differ (suitable only for binary data)

Class = knnclassify(Sample, Training, Group, k, distancerule) enables you to specify the rule used to decide how to classify the sample. Choices for rule are:

  • 'nearest' — Majority rule with nearest point tie-break (default)

  • 'random' — Majority rule with random point tie-break

  • 'consensus' — Consensus rule

The default behavior is to use majority rule. That is, a sample point is assigned to the class the majority of the k nearest neighbors are from. Use 'consensus' to require a consensus, as opposed to majority rule. When using the 'consensus' option, points where not all of the k nearest neighbors are from the same class are not assigned to one of the classes. Instead the output Class for these points is NaN for numerical groups, '' for string named groups, or undefined for categorical groups. When classifying to more than two groups or when using an even value for k, it might be necessary to break a tie in the number of nearest neighbors. Options are 'random', which selects a random tiebreaker, and 'nearest', which uses the nearest neighbor among the tied groups to break the tie. The default behavior is majority rule, with nearest tie-break.

Examples

Classifying Rows

The following example classifies the rows of the matrix sample:

sample = [.9 .8;.1 .3;.2 .6]

sample =
    0.9000    0.8000
    0.1000    0.3000
    0.2000    0.6000

training=[0 0;.5 .5;1 1]

training =
         0         0
    0.5000    0.5000
    1.0000    1.0000

group = [1;2;3]

group =
     1
     2
     3

class = knnclassify(sample, training, group)

class =
     3
     1
     2

Row 1 of sample is closest to row 3 of training, so class(1) = 3. Row 2 of sample is closest to row 1 of training, so class(2) = 1. Row 3 of sample is closest to row 2 of training, so class(3) = 2.

Classifying Rows into One of Two Groups

The following example classifies each row of the data in sample into one of the two groups in training. The following commands create the matrix training and the grouping variable group, and plot the rows of training in two groups.

training = [mvnrnd([ 1  1],   eye(2), 100); ...
            mvnrnd([-1 -1], 2*eye(2), 100)];
group = [repmat(1,100,1); repmat(2,100,1)];
gscatter(training(:,1),training(:,2),group,'rb','+x');
legend('Training group 1', 'Training group 2');
hold on;

The following commands create the matrix sample, classify its rows into two groups, and plot the result.

sample = unifrnd(-5, 5, 100, 2);
% Classify the sample using the nearest neighbor classification
c = knnclassify(sample, training, group);
gscatter(sample(:,1),sample(:,2),c,'mc'); hold on;
legend('Training group 1','Training group 2', ...
       'Data in group 1','Data in group 2');
hold off; 

Classifying Rows Using the Three Nearest Neighbors

The following example uses the same data as in Classifying Rows into One of Two Groups, but classifies the rows of sample using three nearest neighbors instead of one.

gscatter(training(:,1),training(:,2),group,'rb','+x');
hold on;
c3 = knnclassify(sample, training, group, 3);
gscatter(sample(:,1),sample(:,2),c3,'mc','o');
legend('Training group 1','Training group 2','Data in group 1','Data in group 2');

If you compare this plot with the one in Classifying Rows into One of Two Groups, you see that some of the data points are classified differently using three nearest neighbors.

References

[1] Mitchell, T. (1997). Machine Learning, (McGraw-Hill).

See Also

| | | | |

Was this topic helpful?