Quickest way to convert numeric array to a cell array of strings

2 views (last 30 days)
In general terms I need to process many text files with mixed data types, both numeric and text. For the numeric data, I need to process them for quality since I deal with many sources of data and there is no standard I can strictly enforce. The data could be in integer, floating point, or some complex format, and I need to clean these files.
The input data is quickly converted to numbers with textscan, however, trying to convert the newly created numbers back to strings seems to be taking too long (6 times longer than converting the input text to numbers).
I'm currently using a combination of sprintf and textscan to convert a numeric array of doubles into a cell array of strings without losing precision.
numstr = sprintf(sprntffrmtstr, num);
vlnmcllnst = textscan(numstr,txtscnfrmtstr,'delimiter',' ');
vlcll = vlnmcllnst{1};
Lines 1 and 2 are taking up a significant amount of time in the profiler. I use this scheme in a for loop to process output from the textscan that converted the input text to numbers. Each loop is a column vector of numbers and both numstr and vlnmcllnst were pre-allocated before the loop.
Can someone speed this up?
  6 Comments
Cedric
Cedric on 11 May 2013
If I understand well, you get this type of CSV files from somewhere and you convert them into cell arrays using TEXTSCAN, that include both numeric and alphanumeric types, and then you want to convert back numeric values into string? Is the purpose to export back to new/updated CSV files?
You want to be flexible, but it seems to me that all columns (channels?) in your CSV file have the same structure. Do you determine dynamically (and how?) which columns are numeric and which are not? And are you building the formatspec dynamically for SPRINTF/TEXTSCAN (as you seem to have 189 columns)?
Will
Will on 11 May 2013
Edited: Will on 11 May 2013
It sounds like you understand exactly what I've been saying.
The main reason to convert the text to numbers and back to text is to clean/convert any numeric expressions to a standard format before I process the text for character analysis to determine the number of digits, etc. While there are none of these expressions in this data, I have no guarantee that tomorrow it wouldn't be there because of the various sources this data is coming from. This is a robust feature and I have a mode to skip this step if there is ever a need for these numeric expressions to be left as the source text.
I detect if each column is numeric data or not by doing a text to number conversion and catching any errors. I build the format spec from these results, and I've been tinkering with the idea of varying the size of the format string to improve the performance. A couple of tests have indicated to me that the performance of textscan and sprintf improves with a larger format string.. to a point. It looks like 4000 values at once with delimiters may be degrading the performance. I will need to test this theory out more.

Sign in to comment.

Answers (2)

Cedric
Cedric on 12 May 2013
Edited: Cedric on 12 May 2013
Ok, to be honest, my issue at this point is that this approach is not that common, and I can't figure out whether you are an experienced programmer in other languages and you know that it is the way to go - in which case I should focus on optimizing just a few lines of your code - or if you are less experienced - in which case I should discuss the general approach.
As a typical Swiss guy, I'll just take the central path ;-) and propose to discuss some simple code, so we have something concrete for brainstorming.
In the following, I read your CSV file, multiply all numeric values by 2, and export the outcome (including numeric and text values). I try to keep it simple at this stage, so I am using regexp to split the first line of data instead of making some more complicated f/text-scan/f analysis..
fname_in = 'exampledata.csv' ;
fname_out = 'exampledata_new.csv' ;
% - Open input/output files.
fid_in = fopen(fname_in, 'r') ;
fid_out = fopen(fname_out, 'w') ;
% - Copy header.
line = fgetl(fid_in) ;
fwrite(fid_out, line) ;
% - Analyse first line of data, define # of columns
% and which ones are numeric.
line = fgetl(fid_in) ;
buffer = regexp(line, ',', 'split') ;
nCol = numel(buffer) ;
data = str2double(buffer) ;
isnum = ~isnan(data) ; % Vector, flag numeric columns.
% - Build export format.
fmt = cell(1, nCol) ;
fmt(:) = {'%s,'} ;
fmt(isnum) = {'%g,'} ; % Default format for numeric
fmt = [fmt{:}] ; % data is %g at this point.
fmt = [fmt(1:end-1), '\n'] ;
% - Process rest of the file.
while true
% Process numeric values.
data_new = 2 * data(isnum) ;
% Export modified line.
buffer(isnum) = num2cell(data_new) ;
fprintf(fid_out, fmt, buffer{:}) ;
% Exit if end of file.
if feof(fid_in), break ; end
% Read line and extract numeric data.
buffer = regexp(fgetl(fid_in), ',', 'split') ;
data = str2double(buffer) ;
end
% Closes file, free resources, etc.
fclose(fid_in) ;
fclose(fid_out) ;
While this code is not robust, it has some flexibility in the sense that the number of columns could vary and the nature of columns (numeric/text) is detected.
Now if I understand well, you want to process a bit more the first line in order to get more information about the format of each column (so you can reproduce it exactly in the output)?

Will
Will on 16 May 2013
To answer your question I'm an engineer, definitely not a seasoned coder by any means, but not necessarily a beginner to MATLAB coding. That is why I was inquiring here on the speed of various number to text schemes. I have working code that is as flexible as I need it to be, but it just doesn't quite cut mustard for speed.
I will have to try your regexp scheme as far as converting the text to numbers and compare, but from what I've researched, regexp doesn't compare with textscan or fscanf/sscanf in the speed department.
For now I have moved on to optimizing other parts of the code and I created a purely numeric way to extract most all of the information I need from the numeric data that is extremely fast. I have also found some other numeric ways of say extracting the nth digit from a number. I'll need to test these out more to see if they provide enough capability to work with the data.
Thanks for the suggestions.
  1 Comment
Cedric
Cedric on 16 May 2013
Edited: Cedric on 16 May 2013
Regexp was not the point of the code above; it was a simple way to achieve data split/extraction until I fully understand what you want(ed) to achieve, especially on the part that builds the character string to output (precisely the "number to text scheme" that you refer to).

Sign in to comment.

Tags

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!