How can I find numeric values (stored as strings) in table of strings and convert them to double?

Hello,
I have some experience with Matlab and I've found a lot of solutions online so far, but I can't find anything that would solve my current problem. As there are some threads that are similar but not quite IT, I will try to be as specific as possible.
Background:
  • We have a really huge tool that processes our entire data. It reads a lot of input data (via readtable from Excel-Files), combines all the data, compares, calculates, appends etc. etc. In the end, there is a single table (a literal Matlab table) and we write it to an Excel-File with writetable.
  • We experienced over the course of building this tool that it's easier to force the VariavleType to be string. So basically this is our readtable Code:
opts = detectImportOptions(xls,'Sheet',sheet);
opts = setvartype(opts, opts.SelectedVariableNames, 'string'); % force VariableType to be string (both ext and numbers)
data = readtable(xls,opts,'ReadVariableNames',false,'basic',true);
  • It makes comparisons easier, but of course, if we want to calculate something, we have to use str2double and string again:
data.tq = string(str2double(data.x) + str2double(data.y));
Problem:
So this table, we create, is a table full of strings, where also the numbers are stored as strings. With Matlab2018 writetable worked perfectly or let's say "forgivingly". After writetable, we open the Excel-File via ActiveX and do the formatting directly in Excel based on CustomProperties where we, e.g., defined the number of decimals. Even though, we have a table with numbers stored as strings, we've never had a problem to format them in Excel. They, somehow, were stored as numbers in Excel:
Now I'm testing our tool for Matlab2022 and I guess, Matlab has become more rigorous over time. So when we use writetable now, numbers stored as string in the Matlab table will be also be stored as string in Excel. Which is frustrating because now, the formatting via ActiveX doesn't work anymore .
I tried various things:
  1. PreserveFormat in writetable, but apparently it means preserving the Excel Format
  2. Converting the table to an array and/or cell, trying to convert strings to numbers where there are numbers. But converting a table into an array or cell will lead to numeric values stored as char - same problem.
  3. I tried this: How to extract the numerial values out of readtable output? but cellfun(@isnumeric,table2cell(data)) doesn't work because of issue in 2. Or maybe it does with some modifcations that I don't know yet.
  4. In one case, this workaround works. (But only because, in this case, I could set a CustomProperty that allows me to define the variables that are numbers and unfortunately, this is not always the case. So not a universal solution.)
idxDecNum = find(strcmp('num',data.Properties.CustomProperties.VarOutType)); % Find columns where the CustomProperty VarOutType is defined as 'num'
data = convertvars(out,[idxDecNum],'double'); % convert only those found to double
for i = 1:size(idxDecNum,2) % convert each column to cell and store them in table
data.(data.Properties.VariableNames{idxDecNum(i)}) = num2cell(data.(data.Properties.VariableNames{idxDecNum(i)}));
end
My question(s):
  • Is there a possibility to find numeric values that are stored as string/char within a cell array and to convert only them? To be clear, the columns either contain purely numeric values or purely text. There is no mix of numbers and text within a column. But, of course, entries might be empty. So it would not be possible to check the first row for numeric and text values as the respective value might just be empty in the first row and contain data in the second or any following row.
  • Or is there an ActiveX command that converts all numbers stored as text to numbers? (However, I'd rather fix this problem in Matlab.)
Any kind of ideas are appreciated! But no, I really would like to avoid changing the readtable-command. First, I want to try to find a solution for the end.
Thank you, Laura

3 Comments

"...I really would like to avoid changing the readtable-command."
I just knew the saga would end up with the above! :)
However, since you also say "There is no mix of numbers and text within a column" and "I'd rather fix this problem in Matlab" in the long run it will be a much more robust and maintainable solution to fix the problem where you created it by importing the data correctly to start with.
"The fastest and easiest code to debug is that which you don't have to write in the first place..." :)
Providing a sample file that illustrates the input file structure could help...
So you painted yourself into a corner by relying on some undocumented and very dubious behavior that was later fixed, all to avoid handling numeric data as numeric data? Data design that leads to complicated, inefficient (all of those type conversions, storing numeric data as text) and fragile (proof is in the question) code is perhaps a sign that the data design should be revised.
Although not the advice you requested, perhaps you should revise your overall approach and treat numeric data as ... numeric.
@dpb @Stephen23 Thank you for your opinion! :) Voss's solution worked and I will use it for now. However, you are both absolutely right! We should definitely take the time to revise our code completely! :)

Sign in to comment.

 Accepted Answer

"Is there a possibility to find numeric values that are stored as string/char within a cell array and to convert only them?"
Yes.
Example:
% a cell array containing mixed types:
C = {1 '2' "green" []; 'forest' "" 5 "6"; '7' 'tree' 9 NaN}
C = 3×4 cell array
{[ 1]} {'2' } {["green"]} {0×0 double} {'forest'} {["" ]} {[ 5]} {["6" ]} {'7' } {'tree'} {[ 9]} {[ NaN]}
% find the chars and strings:
is_text = cellfun(@(x)ischar(x) || isstring(x),C);
% convert the char/string elements to numeric.
% use cellfun(@str2double,_) rather than str2double alone, in order to
% handle mixed chars and strings (e.g., str2double({'4' "6"}) returns [4 NaN]):
C_numeric = NaN(size(C));
C_numeric(is_text) = cellfun(@str2double,C(is_text));
% take the non-NaN elements of C_numeric (i.e., those that were in fact
% numbers stored as chars/strings) and put them where they belong in C:
is_numeric_text = ~isnan(C_numeric);
C(is_numeric_text) = num2cell(C_numeric(is_numeric_text))
C = 3×4 cell array
{[ 1]} {[ 2]} {["green"]} {0×0 double} {'forest'} {["" ]} {[ 5]} {[ 6]} {[ 7]} {'tree'} {[ 9]} {[ NaN]}
A very similar (simpler) alternative:
% a cell array containing mixed types:
C = {1 '2' "green" []; 'forest' "" 5 "6"; '7' 'tree' 9 NaN};
% apply str2double on all cells (returns NaN for those already numeric):
C_numeric = cellfun(@str2double,C);
% take the non-NaN elements of C_numeric (i.e., those that were in fact
% numbers stored as chars/strings) and put them where they belong in C:
is_numeric_text = ~isnan(C_numeric);
C(is_numeric_text) = num2cell(C_numeric(is_numeric_text))
C = 3×4 cell array
{[ 1]} {[ 2]} {["green"]} {0×0 double} {'forest'} {["" ]} {[ 5]} {[ 6]} {[ 7]} {'tree'} {[ 9]} {[ NaN]}

More Answers (1)

Well, against better judgement, if are adamant about not fixing the problem at its core, then given the previously stated condition that the table variables are all either string or numeric (and a MATLAB table can't mix data types in a variable anyway, other than using a cell array), then I'd do something more like--
isNumberColumn=varfun(@(v)all(isfinite(str2double(v))),tYourTable,'OutputFormat','uniform');
tYourTable=convertvars(tYourTable,tYourTable.Properties.VariableNames(isNumberColumn),'double');
The above will attempt to convert all columns first, but only those that are determined to be able to be will be put into the table...of course, you could have determined this when reading the file by using detectImportOptions that would have found out which were numeric and handled it right off the bat.
With a well-formed input file, using detectImportOptions is probably not required; the builtin scanning inside readtable and friends will almost certainly get it right unless the input file format isn't kosher.
It's certainly not clear what issue you may have run into originally to have chosen this route; it would still be helpful to see the input file format; it just doesn't seem like one would choose to do the force-to-string thing unless there's an issue in the file format.

3 Comments

Thank you very much, I'll try that too!
I'm going off topic now: Unfortunately, it's not that simple to give you an example. We have more than 10 different product inputs of different kinds, different product cycle stages and sources. Meaning, that formatting of products is different in each input file, headers are different (sometimes we need more than one row to make the variables unique). The only thing they have in common is that they are Excel files. All these data need to be matched and converted to the same format so that a product info from cycle stage 1 is matched with data from stage 2 and 3. We have a lot of code (around 20 scripts and many more helper functions) that basically just compares/copies/moves data around given various criteria, ... And my guess is, that two years ago, when we started developing this tool, it was easier (with our knowledge back then) to compare just strings.
I have to mention, coding just makes up 10% of our time and we are just a small group of engineers who have a "thing" for coding. Sometimes we are lucky that maybe one person has more capacity to dive deeper into the code. Each of our knowledge level is completely different but rather than one person coding everything, our main requirement has always been just that it needs to work. Two years ago we didn't know anything about tables but we needed a code that worked asap. Over time, we've gained more experience and it's happened quite often that I found code from two years ago and revised it because since then I've learned a way to do it without looping, e.g. And it will be like this again. However, changing the very core of such a grown tool is more time-consumig than just replacing a loop with a better solution. So please bear with me if I'm going for the quick solution first :)
@Laura V. this is a well known issue and very common, even in commercial SW developement.
There is often a point where the effort required to retain and maintain some code is more than the cost of rewriting it in a better way. Even big apps and online services have to do this sometimes:
Whether you expected it or not at the start of the project, you are now co-authors and developers of code: it is worth acknowledging that, because there is a large body of knowledge and experience that you can benefit from, including the importance of versioning, backups, code helper tools, branching, (backwards/forwards) compatibility, and countless other things that countless developers have learned through trial, error, pain and lots of tears. Learning from the mistakes of others is faster than making all mistakes yourself :)
No one here can tell you if/when code refactoring would be the best for your project... but I recommend that you keep an open mind and stay well-informed :)
If you're not careful, that 10% will keep growing to counteract the proliferation of tools and datasets...@Stephen23's points are well taken to heart.
One suggestion I'd make going forward in the interim to help would be to build a library of import options objects for a given product/version and save the library of those. Then you can have your main code detect the version/product and use the prebuilt options object for that particular file on import. Ideally, your product design built in a product ID/release that you can read; if not that's a second immediate thing to add going forward.

Sign in to comment.

Products

Release

R2022b

Asked:

on 23 Nov 2022

Commented:

dpb
on 24 Nov 2022

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!