knnimpute

Impute missing data using nearest-neighbor method

Syntax

knnimpute(Data)
knnimpute(Data, k)

knnimpute(..., 'Distance', DistanceValue, ...)
knnimpute(..., 'DistArgs', DistArgsValue, ...)
knnimpute(..., 'Weights', WeightsValues, ...)
knnimpute(..., 'Median', MedianValue, ...)

Arguments

DataMatrix
k The number of nearest neighbors used. The default is 1.

Description

knnimpute(Data) replaces NaNs in Data with the corresponding value from the nearest-neighbor column. The nearest-neighbor column is the closest column in Euclidean distance. If the corresponding value from the nearest-neighbor column is also NaN, the next nearest column is used.

knnimpute(Data, k) replaces NaNs in Data with a weighted mean of the k nearest-neighbor columns. The weights are inversely proportional to the distances from the neighboring columns.

knnimpute(..., 'PropertyName', PropertyValue, ...) calls knnimpute with optional properties that use property name/property value pairs. You can specify one or more properties in any order. Each PropertyName must be enclosed in single quotation marks and is case insensitive. These property name/property value pairs are as follows:


knnimpute(..., 'Distance', DistanceValue, ...)
computes nearest-neighbor columns using the distance metric distfun. The choices for DistanceValue are:

'euclidean'Euclidean distance (default).
'seuclidean'Standardized Euclidean distance — each coordinate in the sum of squares is inversely weighted by the sample variance of that coordinate.
'cityblock'City block distance.
'mahalanobis'Mahalanobis distance.
'minkowski'Minkowski distance with exponent 2.
'cosine'One minus the cosine of the included angle.
'correlation'One minus the sample correlation between observations, treated as sequences of values.
'hamming'Hamming distance — the percentage of coordinates that differ.
'jaccard'One minus the Jaccard coefficient — the percentage of nonzero coordinates that differ.
'chebychev'Chebychev distance (maximum coordinate difference).
function handleA handle to a distance function, specified using @, for example, @distfun.

See pdist for more details.

knnimpute(..., 'DistArgs', DistArgsValue, ...) passes arguments (DistArgsValue) to the function distfun. DistArgsValue can be a single value or a cell array of values.

knnimpute(..., 'Weights', WeightsValues, ...) lets you specify the weights used in the weighted mean calculation. w should be a vector of length k.

knnimpute(..., 'Median', MedianValue, ...) when MedianValue is true, uses the median of the k nearest neighbors instead of the weighted mean.

Examples

Example 1

A = [1 2 5;4 5 7;NaN -1 8;7 6 0]

A =

     1     2     5
     4     5     7
   NaN    -1     8
     7     6     0

Note that A(3,1) = NaN. Because column 2 is the closest column to column 1 in Euclidean distance, knnimpute imputes the (3,1) entry of column 1 to be the corresponding entry of column 2, which is -1.

knnimpute(A)

ans =

     1     2     5
     4     5     7
    -1    -1     8
     7     6     0

Example 2

The following example loads the data set yeastdata and imputes missing values in the array yeastvalues:

load yeastdata
% Remove data for empty spots
emptySpots = strcmp('EMPTY',genes);
yeastvalues(emptySpots,:) = [];
genes(emptySpots) = [];
% Impute missing values
imputedValues = knnimpute(yeastvalues);

References

[1] Speed, T. (2003). Statistical Analysis of Gene Expression Microarray Data (Chapman & Hall/CRC).

[2] Hastie, T., Tibshirani, R., Sherlock, G., Eisen, M., Brown, P., and Botstein, D. (1999). "Imputing missing data for gene expression arrays", Technical Report, Division of Biostatistics, Stanford University.

[3] Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525.

Was this topic helpful?