readtable (excel) behavior for string type with empty cell vs spaces

73 views (last 30 days)
I am trying to make an excel import function robust to treat either empty cells or cells containing only spaces in a specific way, namely to fill with the string
""
(I know, this may not seem useful, but it is for me...)
% define the columns of data variable type and name
Variables = [...
"double","Index";
"string","StringData";
"string","Comment"];
% create import options for a table structure that has
% 1st row as a column type descriptor
% 2nd row as the traditional "header" with column names
% 3rd + rows as data
opts = spreadsheetImportOptions(...
"NumVariables",size(Variables,1),...
"VariableNames",Variables(:,2)',...
"VariableTypes",Variables(:,1)',...
"DataRange","A3",...
"VariableNamesRange","A2",...
"VariableDescriptionsRange","A1",...
"RowNamesRange","A3",...
"ImportErrorRule","fill");
% set the variable options for the string data column
opts = setvaropts(opts,"StringData","FillValue","");
out = readtable("example_empty_cell.xlsx",opts)
For the attached excel file, this returns
out =
6×3 table
Index StringData Comment
_____ __________ ______________________
1 1 <missing> "single space"
2 2 <missing> "three spaces"
3 3 <missing> "empty"
4 4 <missing> "apostrophe and space"
5 5 <missing> "apostrophe only"
6 6 "hello" "non empty string"
It's not what I would expect...but if my replacement string is instead
% set the variable options for the string data column
opts = setvaropts(opts,"StringData","FillValue"," ");
I get the expected
out =
6×3 table
Index StringData Comment
_____ __________ ______________________
1 1 " " "single space"
2 2 " " "three spaces"
3 3 " " "empty"
4 4 " " "apostrophe and space"
5 5 " " "apostrophe only"
6 6 "hello" "non empty string"
Perhaps this issue is not specific to readtable/setvaropts, but rather some general internal conversion of "" into <missing>?
One clue is that if I set the fill value to strings(0), it produces an error saying
Expected a value which can be converted to a character vector.
In any case, I'm wondering if there is a one-shot way to achieve what I want without having to post-process the imported data one way or another?

Accepted Answer

J. Alex Lee
J. Alex Lee on 22 Dec 2020
Edited: J. Alex Lee on 22 Dec 2020
According to Jemima's answer, confirmed that the issue is an internal conversion of the "" to <missing> within readtable. Further, this behavior is different in 2019b and previous. Ultimately my issue stems from the fact that these are not equivalent in other contexts that I care about, i.e., that
""==string(missing) % is false
ismissing("") % is also false
My own two work-arounds, which seem more rational to me, are:
  1. Import using the settings I already have, then convert <missing> into "" with rowfun or loops
  2. Anywhere I need to check for "", also check for missing as an alternative: (var=="" || ismissing(var))

More Answers (2)

Jemima Pulipati
Jemima Pulipati on 21 Dec 2020
Edited: Jemima Pulipati on 21 Dec 2020
Hello,
From my understanding, you want to fill the missing/empty values in table with an empty character.
Starting from R2020a, the behaviour of readtable has been modified with respect to the usage of import options (Check Compatibility Considerations). By default readtable uses the import options which implies all missing values are represented using "<missing>". So in your case, when you are trying to set the missing values using
opts = setvaropts(opts,"StringData","FillValue","");
This considers that empty character to be a missing value and uses the standard missing value representation of string i.e '<missing>'. But when you use:
opts = setvaropts(opts,"StringData","FillValue"," ");
This considers the fill value as a valid string and just sets the missing values.
This above behaviour occurs because of import options being used by readtable.
The behaviour of readtable() upto R2019b was different i.e it used the empty character representation instead of '<missing>' to represent missing values in tables.
The suggested workarounds are:
  • You may have to use 'Fomat: auto' to use the empty character representation (specifying this option chooses the behaviour that was present upto R2019b and does not by default consider the import options).
Example:
out = readtable("example_empty_cell.xlsx","Format","auto")
  • readtable can use the import options and after reading the table you may set the column values explicitly to display according to the use case.
Example :
out = readtable("example_empty_cell.xlsx",opts);
out(1:end-1,2) = {''};
  • Modifying the variable type to 'char' instead of 'string' for the column.
Variables = [...
"double","Index";
"char","StringData";
"string","Comment"];
This shows the empty character as a representation for the missing values as this is the representation for the 'char' datatype.
NaN double, single, duration, and calendarDuration
NaT — datetime
<missing> string
<undefined> categorical
' ' char
{''} cell of character arrays
  1 Comment
J. Alex Lee
J. Alex Lee on 21 Dec 2020
Edited: J. Alex Lee on 21 Dec 2020
Hmm. Thanks for the exaplanation and potential workarounds...
But if I understand the first work around correctly, your advice to use "Format auto" introduces an ambiguity in code readability that has matlab version-dependence, which I believe is not desirable...
Second workaround suggests using char type instead of string for import, and then post-process later...but this is not any better to me than post-processing the <missing> string value into "", which seems cleaner...
And I'm probably barking up the wrong tree here, but I'm not sure I understand the logic behind this behavior change, especially since
string(missing)==""
ismissing("")
both return false...ultimately, I can live with a <missing> string value if I can test for missing instead of equal to empty character in all contexts.

Sign in to comment.


Julian Hapke
Julian Hapke on 3 Jun 2021
Edited: Julian Hapke on 3 Jun 2021
If you do not use the FillValue property of the VariableImportOptions, but set EmptyFieldRule to "auto", missing strings are imported as "" instead of <missing>. Since you can set this for every variable separately, this does not make the use of "Format", "auto" necessary.
Edit: unfortunately in my test case, not all missing values are replaced by an empty string, some still come back as missing. this is really frustrating.
  3 Comments
Julian Hapke
Julian Hapke on 3 Jun 2021
yes, after checking my real world data instead of a simple test case, I must agree (see edit), it's not really useful.
J. Alex Lee
J. Alex Lee on 3 Jun 2021
Edited: J. Alex Lee on 3 Jun 2021
Ah yes. i think maybe the core of the issue is not how it is imported, but rather how downstream of that we can treat the 2 things semantically...I guess you could argue the below results are logical/consistent...but could it also be argued that ismissing("") ought to return true?
ismissing("")
ans = logical
0
isempty("")
ans = logical
0
ismissing(missing)
ans = logical
1
isempty(missing)
ans = logical
0

Sign in to comment.

Categories

Find more on Data Type Conversion in Help Center and File Exchange

Products


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!