Documentation

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English verison of the page.

Note: This page has been translated by MathWorks. Please click here
To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

knnimpute

Impute missing data using nearest-neighbor method

Syntax

knnimpute(Data)
knnimpute(Data, k)
knnimpute(..., 'Distance', DistanceValue, ...)
knnimpute(..., 'DistArgs', DistArgsValue, ...)
knnimpute(..., 'Weights', WeightsValues, ...)
knnimpute(..., 'Median', MedianValue, ...)

Arguments

Data Matrix
k Positive integer specifying the number of nearest neighbors used

Description

knnimpute(Data) replaces NaNs in Data with the corresponding value from the nearest-neighbor column. The nearest-neighbor column is the closest column in Euclidean distance. If the corresponding value from the nearest-neighbor column is also NaN, the next nearest column is used.

knnimpute(Data, k) replaces NaNs in Data with a weighted mean of the k nearest-neighbor columns. The weights are inversely proportional to the distances from the neighboring columns.

knnimpute(..., 'PropertyName', PropertyValue, ...) calls knnimpute with optional properties that use property name/property value pairs. You can specify one or more properties in any order. Each PropertyName must be enclosed in single quotation marks and is case insensitive. These property name/property value pairs are as follows:

knnimpute(..., 'Distance', DistanceValue, ...) computes nearest-neighbor columns using the distance metric distfun. The choices for DistanceValue are:

'euclidean'Euclidean distance (default).
'seuclidean'Standardized Euclidean distance — each coordinate in the sum of squares is inversely weighted by the sample variance of that coordinate.
'cityblock'City block distance.
'mahalanobis'Mahalanobis distance.
'minkowski'Minkowski distance with exponent 2.
'cosine'One minus the cosine of the included angle.
'correlation'One minus the sample correlation between observations, treated as sequences of values.
'hamming'Hamming distance — the percentage of coordinates that differ.
'jaccard'One minus the Jaccard coefficient — the percentage of nonzero coordinates that differ.
'chebychev'Chebychev distance (maximum coordinate difference).
function handleA handle to a distance function, specified using @, for example, @distfun.

See pdist for more details.

knnimpute(..., 'DistArgs', DistArgsValue, ...) passes arguments (DistArgsValue) to the function distfun. DistArgsValue can be a single value or a cell array of values.

knnimpute(..., 'Weights', WeightsValues, ...) lets you specify the weights used in the weighted mean calculation. w should be a vector of length k.

knnimpute(..., 'Median', MedianValue, ...) when MedianValue is true, uses the median of the k nearest neighbors instead of the weighted mean.

Examples

Example 1

A = [1 2 5;4 5 7;NaN -1 8;7 6 0]

A =

     1     2     5
     4     5     7
   NaN    -1     8
     7     6     0

Note that A(3,1) = NaN. Because column 2 is the closest column to column 1 in Euclidean distance, knnimpute imputes the (3,1) entry of column 1 to be the corresponding entry of column 2, which is -1.

knnimpute(A)

ans =

     1     2     5
     4     5     7
    -1    -1     8
     7     6     0

Example 2

The following example loads the data set yeastdata and imputes missing values in the array yeastvalues:

load yeastdata
% Remove data for empty spots
emptySpots = strcmp('EMPTY',genes);
yeastvalues(emptySpots,:) = [];
genes(emptySpots) = [];
% Impute missing values
imputedValues = knnimpute(yeastvalues);

References

[1] Speed, T. (2003). Statistical Analysis of Gene Expression Microarray Data (Chapman & Hall/CRC).

[2] Hastie, T., Tibshirani, R., Sherlock, G., Eisen, M., Brown, P., and Botstein, D. (1999). “Imputing missing data for gene expression arrays”, Technical Report, Division of Biostatistics, Stanford University.

[3] Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525.

Introduced before R2006a

Was this topic helpful?