Clean Messy and Missing Data in Tables

This example shows how to find, clean, and delete table rows with missing data.

Create and Load Sample Data

Create a comma-separated text file, messy.csv, that contains the following data.

afe1,3,yes,3,3
egh3,.,no,7,7
wth4,3,yes,3,3
atn2,23,no,23,23
arg1,5,yes,5,5
jre3,34.6,yes,34.6,34.6
wen9,234,yes,234,234
ple2,2,no,2,2
dbo8,5,no,5,5
oii4,5,yes,5,5
wnk3,245,yes,245,245
abk6,563,,563,563
pnj5,463,no,463,463
wnn3,6,no,6,6
oks9,23,yes,23,23
wba3,,yes,NaN,14
pkn4,2,no,2,2
adw3,22,no,22,22
poj2,-99,yes,-99,-99
bas8,23,no,23,23
gry5,NA,yes,NaN,21

There are many different missing data indicators in messy.csv.

  • Empty string ('')

  • period (.)

  • NA

  • NaN

  • -99

Create a table from the comma-separated text file and indicate that the first row of the file does not represent variable names. Use the 'TreatAsEmpty' name-value pair argument with readtable to specify strings to treat as empty values.

T = readtable('messy.csv',...
    'ReadVariableNames',false,...
    'TreatAsEmpty',{'.','NA'})
T = 

     Var1     Var2    Var3     Var4    Var5
    ______    ____    _____    ____    ____

    'afe1'       3    'yes'       3       3
    'egh3'     NaN    'no'        7       7
    'wth4'       3    'yes'       3       3
    'atn2'      23    'no'       23      23
    'arg1'       5    'yes'       5       5
    'jre3'    34.6    'yes'    34.6    34.6
    'wen9'     234    'yes'     234     234
    'ple2'       2    'no'        2       2
    'dbo8'       5    'no'        5       5
    'oii4'       5    'yes'       5       5
    'wnk3'     245    'yes'     245     245
    'abk6'     563    ''        563     563
    'pnj5'     463    'no'      463     463
    'wnn3'       6    'no'        6       6
    'oks9'      23    'yes'      23      23
    'wba3'     NaN    'yes'     NaN      14
    'pkn4'       2    'no'        2       2
    'adw3'      22    'no'       22      22
    'poj2'     -99    'yes'     -99     -99
    'bas8'      23    'no'       23      23
    'gry5'     NaN    'yes'     NaN      21

T is a table with 21 rows and 5 variables. 'TreatAsEmpty' only applies to numeric columns in the file and cannot handle numeric literals, such as '-99'.

Summarize the Table

View the data type, description, units, and other descriptive statistics for each variable by using summary to summarize the table.

summary(T)
Variables:

    Var1: 21x1 cell string

    Var2: 21x1 double
        Values:

            min       -99   
            median     14   
            max       563   
            NaNs        3   

    Var3: 21x1 cell string

    Var4: 21x1 double
        Values:

            min       -99   
            median      7   
            max       563   
            NaNs        2   

    Var5: 21x1 double
        Values:

            min       -99   
            median     14   
            max       563   

When you import data from a file, the default is for readtable to read any variables with nonnumeric elements as a cell array of strings.

Find Rows with Missing Values

Display the subset of rows from the table, T, that have at least one missing value.

TF = ismissing(T,{'' '.' 'NA' NaN -99});
T(any(TF,2),:)
ans = 

     Var1     Var2    Var3     Var4    Var5
    ______    ____    _____    ____    ____

    'egh3'    NaN     'no'       7       7 
    'abk6'    563     ''       563     563 
    'wba3'    NaN     'yes'    NaN      14 
    'poj2'    -99     'yes'    -99     -99 
    'gry5'    NaN     'yes'    NaN      21 

readtable replaced '.' and 'NA' with NaN in the numeric variables, Var2, Var4, and Var5.

Replace Missing Value Indicators

Clean the data so that the missing values indicated by code -99 have the standard MATLAB® numeric missing value indicator, NaN.

T = standardizeMissing(T,-99)
T = 

     Var1     Var2    Var3     Var4    Var5
    ______    ____    _____    ____    ____

    'afe1'       3    'yes'       3       3
    'egh3'     NaN    'no'        7       7
    'wth4'       3    'yes'       3       3
    'atn2'      23    'no'       23      23
    'arg1'       5    'yes'       5       5
    'jre3'    34.6    'yes'    34.6    34.6
    'wen9'     234    'yes'     234     234
    'ple2'       2    'no'        2       2
    'dbo8'       5    'no'        5       5
    'oii4'       5    'yes'       5       5
    'wnk3'     245    'yes'     245     245
    'abk6'     563    ''        563     563
    'pnj5'     463    'no'      463     463
    'wnn3'       6    'no'        6       6
    'oks9'      23    'yes'      23      23
    'wba3'     NaN    'yes'     NaN      14
    'pkn4'       2    'no'        2       2
    'adw3'      22    'no'       22      22
    'poj2'     NaN    'yes'     NaN     NaN
    'bas8'      23    'no'       23      23
    'gry5'     NaN    'yes'     NaN      21

standardizeMissing replaces three instances of -99 with NaN.

Create a Table with Complete Rows

Create a new table, T2, that contains only the complete rows—those without missing data.

TF = ismissing(T);
T2 = T(~any(TF,2),:)
T2 = 

     Var1     Var2    Var3     Var4    Var5
    ______    ____    _____    ____    ____

    'afe1'       3    'yes'       3       3
    'wth4'       3    'yes'       3       3
    'atn2'      23    'no'       23      23
    'arg1'       5    'yes'       5       5
    'jre3'    34.6    'yes'    34.6    34.6
    'wen9'     234    'yes'     234     234
    'ple2'       2    'no'        2       2
    'dbo8'       5    'no'        5       5
    'oii4'       5    'yes'       5       5
    'wnk3'     245    'yes'     245     245
    'pnj5'     463    'no'      463     463
    'wnn3'       6    'no'        6       6
    'oks9'      23    'yes'      23      23
    'pkn4'       2    'no'        2       2
    'adw3'      22    'no'       22      22
    'bas8'      23    'no'       23      23

T2 contains 16 rows and 5 variables.

Organize the Data

Sort the rows of T2 in descending order by Var3, and then in ascending order by Var1.

T2 = sortrows(T2,{'Var3','Var1'},{'descend','ascend'})
T2 = 

     Var1     Var2    Var3     Var4    Var5
    ______    ____    _____    ____    ____

    'afe1'       3    'yes'       3       3
    'arg1'       5    'yes'       5       5
    'jre3'    34.6    'yes'    34.6    34.6
    'oii4'       5    'yes'       5       5
    'oks9'      23    'yes'      23      23
    'wen9'     234    'yes'     234     234
    'wnk3'     245    'yes'     245     245
    'wth4'       3    'yes'       3       3
    'adw3'      22    'no'       22      22
    'atn2'      23    'no'       23      23
    'bas8'      23    'no'       23      23
    'dbo8'       5    'no'        5       5
    'pkn4'       2    'no'        2       2
    'ple2'       2    'no'        2       2
    'pnj5'     463    'no'      463     463
    'wnn3'       6    'no'        6       6

In Var3, the rows are grouped first grouped by 'yes', followed by 'no'. Then in Var1, the rows are listed alphabetically.

See Also

| | |

Related Examples

Was this topic helpful?