| Bioinformatics Toolbox™ | ![]() |
knnimpute(Data)
knnimpute(Data, k)
knnimpute(..., 'Distance', DistanceValue,
...)
knnimpute(..., 'DistArgs', DistArgsValue,
...)
knnimpute(..., 'Weights', WeightsValues,
...)
knnimpute(..., 'Median', MedianValue,
...)
| Data | |
| k |
knnimpute(Data) replaces NaNs in Data with the corresponding value from the nearest-neighbor column. The nearest-neighbor column is the closest column in Euclidean distance. If the corresponding value from the nearest-neighbor column is also NaN, the next nearest column is used.
knnimpute(Data, k) replaces NaNs in Data with a weighted mean of the k nearest-neighbor columns. The weights are inversely proportional to the distances from the neighboring columns.
knnimpute(..., 'PropertyName', PropertyValue, ...) calls knnimpute with optional properties that use property name/property value pairs. You can specify one or more properties in any order. Each PropertyName must be enclosed in single quotation marks and is case insensitive. These property name/property value pairs are as follows:
knnimpute(..., 'Distance', DistanceValue,
...) computes nearest-neighbor columns using the distance metric distfun.
The choices for DistanceValue are:
| 'euclidean' | Euclidean distance (default). |
| 'seuclidean' | Standardized Euclidean distance — each coordinate in the sum of squares is inversely weighted by the sample variance of that coordinate. |
| 'cityblock' | City block distance. |
| 'mahalanobis' | Mahalanobis distance. |
| 'minkowski' | Minkowski distance with exponent 2. |
| 'cosine' | One minus the cosine of the included angle. |
| 'correlation' | One minus the sample correlation between observations, treated as sequences of values. |
| 'hamming' | Hamming distance — the percentage of coordinates that differ. |
| 'jaccard' | One minus the Jaccard coefficient — the percentage of nonzero coordinates that differ. |
| 'chebychev' | Chebychev distance (maximum coordinate difference). |
| function handle | A handle to a distance function, specified using @, for example, @distfun. |
See pdist for more details.
knnimpute(..., 'DistArgs', DistArgsValue, ...) passes arguments (DistArgsValue) to the function distfun. DistArgsValue can be a single value or a cell array of values.
knnimpute(..., 'Weights', WeightsValues, ...) lets you specify the weights used in the weighted mean calculation. w should be a vector of length k.
knnimpute(..., 'Median', MedianValue, ...) when MedianValue is true, uses the median of the k nearest neighbors instead of the weighted mean.
A = [1 2 5;4 5 7;NaN -1 8;7 6 0]
A =
1 2 5
4 5 7
NaN -1 8
7 6 0Note that A(3,1) = NaN. Because column 2 is the closest column to column 1 in Euclidean distance, knnimpute imputes the (3,1) entry of column 1 to be the corresponding entry of column 2, which is -1.
knnimpute(A)
ans =
1 2 5
4 5 7
-1 -1 8
7 6 0The following example loads the data set yeastdata and imputes missing values in the array yeastvalues:
load yeastdata
% Remove data for empty spots
emptySpots = strcmp('EMPTY',genes);
yeastvalues(emptySpots,:) = [];
genes(emptySpots) = [];
% Impute missing values
imputedValues = knnimpute(yeastvalues);[1] Speed, T. (2003). Statistical Analysis of Gene Expression Microarray Data (Chapman & Hall/CRC).
[2] Hastie, T., Tibshirani, R., Sherlock, G., Eisen, M., Brown, P., and Botstein, D. (1999). "Imputing missing data for gene expression arrays", Technical Report, Division of Biostatistics, Stanford University.
[3] Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525.
Statistics Toolbox™ function: knnclassify
MATLAB® function: isnan
Statistics Toolbox functions: nanmean, nanmedian, pdist
![]() | knnclassify | maboxplot | ![]() |
| © 1984-2008- The MathWorks, Inc. - Site Help - Patents - Trademarks - Privacy Policy - Preventing Piracy - RSS |