This example shows how to find, clean, and delete observations with missing data in a dataset array.
Import the data from the spreadsheet messy.xlsx.
messyData = dataset('XLSFile',fullfile(matlabroot,'help/toolbox/stats/examples','messy.xlsx'))
messyData =
var1 var2 var3 var4 var5
'afe1' '3' 'yes' '3' 3
'egh3' '.' 'no' '7' 7
'wth4' '3' 'yes' '3' 3
'atn2' '23' 'no' '23' 23
'arg1' '5' 'yes' '5' 5
'jre3' '34.6' 'yes' '34.6' 34.6
'wen9' '234' 'yes' '234' 234
'ple2' '2' 'no' '2' 2
'dbo8' '5' 'no' '5' 5
'oii4' '5' 'yes' '5' 5
'wnk3' '245' 'yes' '245' 245
'abk6' '563' '' '563' 563
'pnj5' '463' 'no' '463' 463
'wnn3' '6' 'no' '6' 6
'oks9' '23' 'yes' '23' 23
'wba3' '' 'yes' 'NaN' 14
'pkn4' '2' 'no' '2' 2
'adw3' '22' 'no' '22' 22
'poj2' '-99' 'yes' '-99' -99
'bas8' '23' 'no' '23' 23
'gry5' 'NA' 'yes' 'NaN' 21When you import data from a spreadsheet, dataset reads any
variables with nonnumeric elements as a cell array of character vectors. This is
why the variable var2 is a cell array of character vectors.
When importing data from a text file, you have more flexibility to specify which
nonnumeric expressions to treat as missing using the option
TreatAsEmpty.
There are many different missing data indicators in
messy.xlsx, such as:
Empty cells
A period (.)
NA
NaN
-99
Display the subset of observations that have at least one missing
value using ismissing.
ix = ismissing(messyData,'NumericTreatAsMissing',-99,... 'StringTreatAsMissing',{'NaN','.','NA'}); messyData(any(ix,2),:)
ans =
var1 var2 var3 var4 var5
'egh3' '.' 'no' '7' 7
'abk6' '563' '' '563' 563
'wba3' '' 'yes' 'NaN' 14
'poj2' '-99' 'yes' '-99' -99
'gry5' 'NA' 'yes' 'NaN' 21
By default, ismissing recognizes the following
missing value indicators:
NaN for numeric arrays
'' for character arrays
<undefined> for categorical
arrays
Use the NumericTreatAsMissing and StringTreatAsMissing options
to specify other values to treat as missing.
You can convert the char variables that should
be numeric using str2double.
messyData.var2 = str2double(messyData.var2); messyData.var4 = str2double(messyData.var4)
messyData =
var1 var2 var3 var4 var5
'afe1' 3 'yes' 3 3
'egh3' NaN 'no' 7 7
'wth4' 3 'yes' 3 3
'atn2' 23 'no' 23 23
'arg1' 5 'yes' 5 5
'jre3' 34.6 'yes' 34.6 34.6
'wen9' 234 'yes' 234 234
'ple2' 2 'no' 2 2
'dbo8' 5 'no' 5 5
'oii4' 5 'yes' 5 5
'wnk3' 245 'yes' 245 245
'abk6' 563 '' 563 563
'pnj5' 463 'no' 463 463
'wnn3' 6 'no' 6 6
'oks9' 23 'yes' 23 23
'wba3' NaN 'yes' NaN 14
'pkn4' 2 'no' 2 2
'adw3' 22 'no' 22 22
'poj2' -99 'yes' -99 -99
'bas8' 23 'no' 23 23
'gry5' NaN 'yes' NaN 21var2 and var4 are
numeric arrays. During the conversion, str2double replaces
the nonnumeric elements of the variables var2 and var4 with
the value NaN. However, there are no changes to
the numeric missing value indicator, -99.When applying the same function to many dataset array variables,
it can sometimes be more convenient to use datasetfun.
For example, to convert both var2 and var4 to
numeric arrays simultaneously, you can use:
messyData(:,[2,4]) = datasetfun(@str2double,messyData, ...
'DataVars',[2,4],'DatasetOutput',true);Clean the data so that the missing values indicated by the code -99 have
the standard MATLAB® numeric missing value indicator, NaN.
messyData = replaceWithMissing(messyData,'NumericValues',-99)messyData =
var1 var2 var3 var4 var5
'afe1' 3 'yes' 3 3
'egh3' NaN 'no' 7 7
'wth4' 3 'yes' 3 3
'atn2' 23 'no' 23 23
'arg1' 5 'yes' 5 5
'jre3' 34.6 'yes' 34.6 34.6
'wen9' 234 'yes' 234 234
'ple2' 2 'no' 2 2
'dbo8' 5 'no' 5 5
'oii4' 5 'yes' 5 5
'wnk3' 245 'yes' 245 245
'abk6' 563 '' 563 563
'pnj5' 463 'no' 463 463
'wnn3' 6 'no' 6 6
'oks9' 23 'yes' 23 23
'wba3' NaN 'yes' NaN 14
'pkn4' 2 'no' 2 2
'adw3' 22 'no' 22 22
'poj2' NaN 'yes' NaN NaN
'bas8' 23 'no' 23 23
'gry5' NaN 'yes' NaN 21Create a new dataset array that contains only the complete observations—those without missing data.
ix = ismissing(messyData); completeData = messyData(~any(ix,2),:)
completeData =
var1 var2 var3 var4 var5
'afe1' 3 'yes' 3 3
'wth4' 3 'yes' 3 3
'atn2' 23 'no' 23 23
'arg1' 5 'yes' 5 5
'jre3' 34.6 'yes' 34.6 34.6
'wen9' 234 'yes' 234 234
'ple2' 2 'no' 2 2
'dbo8' 5 'no' 5 5
'oii4' 5 'yes' 5 5
'wnk3' 245 'yes' 245 245
'pnj5' 463 'no' 463 463
'wnn3' 6 'no' 6 6
'oks9' 23 'yes' 23 23
'pkn4' 2 'no' 2 2
'adw3' 22 'no' 22 22
'bas8' 23 'no' 23 23dataset | ismissing | replaceWithMissing