Documentation

Clean Messy and Missing Data in Tables

This example shows how to find, clean, and delete table rows with missing data.

Create and Load Sample Data

Create a comma-separated text file, messy.csv, that contains the following data.

A,B,C,D,E
afe1,3,yes,3,3
egh3,.,no,7,7
wth4,3,yes,3,3
atn2,23,no,23,23
arg1,5,yes,5,5
jre3,34.6,yes,34.6,34.6
wen9,234,yes,234,234
ple2,2,no,2,2
dbo8,5,no,5,5
oii4,5,yes,5,5
wnk3,245,yes,245,245
abk6,563,,563,563
pnj5,463,no,463,463
wnn3,6,no,6,6
oks9,23,yes,23,23
wba3,,yes,NaN,14
pkn4,2,no,2,2
adw3,22,no,22,22
poj2,-99,yes,-99,-99
bas8,23,no,23,23
gry5,NA,yes,NaN,21

There are many different missing data indicators in messy.csv.

  • Empty string ('')

  • period (.)

  • NA

  • NaN

  • -99

Create a table from the comma-separated text file. To specify strings to treat as empty values, use the 'TreatAsEmpty' name-value pair argument with the readtable function.

T = readtable('messy.csv',...
    'TreatAsEmpty',{'.','NA'})
T = 

      A        B        C       D       E  
    ______    ____    _____    ____    ____

    'afe1'       3    'yes'       3       3
    'egh3'     NaN    'no'        7       7
    'wth4'       3    'yes'       3       3
    'atn2'      23    'no'       23      23
    'arg1'       5    'yes'       5       5
    'jre3'    34.6    'yes'    34.6    34.6
    'wen9'     234    'yes'     234     234
    'ple2'       2    'no'        2       2
    'dbo8'       5    'no'        5       5
    'oii4'       5    'yes'       5       5
    'wnk3'     245    'yes'     245     245
    'abk6'     563    ''        563     563
    'pnj5'     463    'no'      463     463
    'wnn3'       6    'no'        6       6
    'oks9'      23    'yes'      23      23
    'wba3'     NaN    'yes'     NaN      14
    'pkn4'       2    'no'        2       2
    'adw3'      22    'no'       22      22
    'poj2'     -99    'yes'     -99     -99
    'bas8'      23    'no'       23      23
    'gry5'     NaN    'yes'     NaN      21

T is a table with 21 rows and five variables. 'TreatAsEmpty' only applies to numeric columns in the file and cannot handle numeric literals, such as '-99'.

Summarize Table

View the data type, description, units, and other descriptive statistics for each variable by creating a table summary using the summary function.

summary(T)
Variables:

    A: 21x1 cell string

    B: 21x1 double
        Values:

            min       -99
            median     14
            max       563
            NaNs        3

    C: 21x1 cell string

    D: 21x1 double
        Values:

            min       -99
            median      7
            max       563
            NaNs        2

    E: 21x1 double
        Values:

            min       -99
            median     14
            max       563

When you import data from a file, the default is for readtable to read any variables with nonnumeric elements as a cell array of strings.

Find Rows with Missing Values

Display the subset of rows from the table, T, that have at least one missing value.

TF = ismissing(T,{'' '.' 'NA' NaN -99});
T(any(TF,2),:)
ans = 

      A        B       C       D      E 
    ______    ___    _____    ___    ___

    'egh3'    NaN    'no'       7      7
    'abk6'    563    ''       563    563
    'wba3'    NaN    'yes'    NaN     14
    'poj2'    -99    'yes'    -99    -99
    'gry5'    NaN    'yes'    NaN     21

readtable replaced '.' and 'NA' with NaN in the numeric variables, B, D, and E.

Replace Missing Value Indicators

Clean the data so that the missing values indicated by code -99 have the standard MATLAB® numeric missing value indicator, NaN.

T = standardizeMissing(T,-99)
T = 

      A        B        C       D       E  
    ______    ____    _____    ____    ____

    'afe1'       3    'yes'       3       3
    'egh3'     NaN    'no'        7       7
    'wth4'       3    'yes'       3       3
    'atn2'      23    'no'       23      23
    'arg1'       5    'yes'       5       5
    'jre3'    34.6    'yes'    34.6    34.6
    'wen9'     234    'yes'     234     234
    'ple2'       2    'no'        2       2
    'dbo8'       5    'no'        5       5
    'oii4'       5    'yes'       5       5
    'wnk3'     245    'yes'     245     245
    'abk6'     563    ''        563     563
    'pnj5'     463    'no'      463     463
    'wnn3'       6    'no'        6       6
    'oks9'      23    'yes'      23      23
    'wba3'     NaN    'yes'     NaN      14
    'pkn4'       2    'no'        2       2
    'adw3'      22    'no'       22      22
    'poj2'     NaN    'yes'     NaN     NaN
    'bas8'      23    'no'       23      23
    'gry5'     NaN    'yes'     NaN      21

standardizeMissing replaces three instances of -99 with NaN.

Create Table with Complete Rows

Create a new table, T2, that contains only the complete rows—those without missing data.

TF = ismissing(T);
T2 = T(~any(TF,2),:)
T2 = 

      A        B        C       D       E  
    ______    ____    _____    ____    ____

    'afe1'       3    'yes'       3       3
    'wth4'       3    'yes'       3       3
    'atn2'      23    'no'       23      23
    'arg1'       5    'yes'       5       5
    'jre3'    34.6    'yes'    34.6    34.6
    'wen9'     234    'yes'     234     234
    'ple2'       2    'no'        2       2
    'dbo8'       5    'no'        5       5
    'oii4'       5    'yes'       5       5
    'wnk3'     245    'yes'     245     245
    'pnj5'     463    'no'      463     463
    'wnn3'       6    'no'        6       6
    'oks9'      23    'yes'      23      23
    'pkn4'       2    'no'        2       2
    'adw3'      22    'no'       22      22
    'bas8'      23    'no'       23      23

T2 contains 16 rows and 5 variables.

Organize Data

Sort the rows of T2 in descending order by C, and then sort in ascending order by A.

T2 = sortrows(T2,{'C','A'},{'descend','ascend'})
T2 = 

      A        B        C       D       E  
    ______    ____    _____    ____    ____

    'afe1'       3    'yes'       3       3
    'arg1'       5    'yes'       5       5
    'jre3'    34.6    'yes'    34.6    34.6
    'oii4'       5    'yes'       5       5
    'oks9'      23    'yes'      23      23
    'wen9'     234    'yes'     234     234
    'wnk3'     245    'yes'     245     245
    'wth4'       3    'yes'       3       3
    'adw3'      22    'no'       22      22
    'atn2'      23    'no'       23      23
    'bas8'      23    'no'       23      23
    'dbo8'       5    'no'        5       5
    'pkn4'       2    'no'        2       2
    'ple2'       2    'no'        2       2
    'pnj5'     463    'no'      463     463
    'wnn3'       6    'no'        6       6

In C, the rows are grouped first by 'yes', followed by 'no'. Then in A, the rows are listed alphabetically.

Reorder the table so that A and C are next to each other.

T2 = T2(:,{'A','C','B','D','E'})
T2 = 

      A         C       B       D       E  
    ______    _____    ____    ____    ____

    'afe1'    'yes'       3       3       3
    'arg1'    'yes'       5       5       5
    'jre3'    'yes'    34.6    34.6    34.6
    'oii4'    'yes'       5       5       5
    'oks9'    'yes'      23      23      23
    'wen9'    'yes'     234     234     234
    'wnk3'    'yes'     245     245     245
    'wth4'    'yes'       3       3       3
    'adw3'    'no'       22      22      22
    'atn2'    'no'       23      23      23
    'bas8'    'no'       23      23      23
    'dbo8'    'no'        5       5       5
    'pkn4'    'no'        2       2       2
    'ple2'    'no'        2       2       2
    'pnj5'    'no'      463     463     463
    'wnn3'    'no'        6       6       6

See Also

| | |

Related Examples

Was this topic helpful?