File Exchange

image thumbnail

Fast String to Double Conversion

version 1.8 (16.7 KB) by

str2doubleq converts text to double like Matlab's str2double,but up to 400x faster! multithreaded.

4.64286
17 Ratings

33 Downloads

Updated

View License

str2doubleq is equivalent to the Matlab built-in str2double function that converts char or cellstr array to appropriate double arrays. The drawback of built-in str2double is that it becomes very slow when the dataset becomes larger.

str2doubleq exploits C++ fast string handling capabilities. Also if you have a compiler supporting new C++11 standard or you have Boost libraries installed on your computer, you can use the multithreaded algorithm. Multithreaded algorithm scales very well if data set is sufficiently large.

Function has been programmed exactly to the same behavior as str2double.

Original demand for the function has arisen from certain market data parsing problems that had to be done in real time. Now Matlab can be as fast as traditional programming languages in these types of string parsing problems.

Installation:

*Copy the file str2doubleq.cpp somewhere in hard drive. (Example C:\Test\str2doubleq.cpp)

*Launch Matlab and compile the source file to generate machine dependent binary. If you have not selected a compiler this needs to be done first (run mex -setup in command window).

* Source is compiled typing mex <c-source folder>
(Example mex C:\Test\str2doubleq.cpp)

*Place the generated str2doubleq.mexw32 (32-bit) or str2doubleq.mexw64 (64-bit) to Matlab's scope (set path- folder group)

*If you want to increase performance even more, then uncomment the line 35 from str2doubleq.cpp (containing #define USE_PARALLEL_ALGORITHM). Remeber that you need to have modern enough compiler or Boost (http://www.boost.org/) installed.

Now you can use the function in normal matlab fashion. Run the testcases script test_str_to_double_performance.m (included in zip-file)

Comments and Ratings (28)

User12399

3. performance was even higher with real() wrapped around str2doubleq Ó.ó wtf?

User12399

1. Since I used the output to convert some numbers to logicals and MATLAB would not allow that with complex doubles I had to use real() wrapped around the str2doubleq
no complaint of course. only mentioning it in case somebody encounters that problem,too

2. I ran the evaluation function with a continuous incriease in input times several times after noticing that I did not get any performance increase in my script.
Here is the funny thing: for inputsizes of 2 and 5 the performance is "only" 2x or less.
performance>10x kicks in after inputSize > 6

but like quant guy said - str2doubleq is designed for large arrays.

SUTHEP PINROJ

Peter Fraser

Thanks a million. My initial file parsing time has gone down from 20 seconds to 2 seconds.

H. Homann

worden235

Ingrid

Ingrid (view profile)

I was getting bored of waiting for str2double (repeated calls resulted in 15s of the 19s my code needed was spend on str2double!) This was reduced to 0.093 seconds with this function so now only have tot wait 3seconds, THANKS

@Jonathan
I think you're right. A hacky fix is to change the call to mxArrayToString.

char *freeme = mxArrayToString(mxStr);
const char *s = freeme;
...
mxFree(freeme)

the code modifies the pointer s, so calling mxFree on s causes a segfault! This little dance gets around it.

For anyone that requires str2doubleq to return NaN rather than 0 when called on a blank string (eg. str2doubleq('')), you can add the line:
if (!(*s)) return false;

in the function parse_to_double right after the line "if (!s) return false;"

Jonathan

Upon further testing, this function leaks tons of memory. The function calls mxArrayToString() but does not call mxFree(), as required to release memory allocated to the array. In no time at all, repeated calls quickly exceed my machine's 72GB of RAM.

Jonathan

When Quant Guy said it was faster than Matlab's str2double, he wasn't joking!!!

@Matthias,

This is by design of the isreal function.

From the doc,:
If A has a stored imaginary part of value 0, isreal(A) returns logical 0 (false).

You may however expect that the returned number is not complex when the imaginary part is 0.

Jan Simon

Jan Simon (view profile)

The idea is very good, but the results are not reliable. I cannot suggest to use this for productive work. A fair rating is not easy in this case, therefore I've hesitated for some years now.

Matthias

Hello Lauri,
there are still some differences
isreal(str2doubleq('1')) % 0 instead of 1
str2double('2.236')-str2doubleq('2.236') % is not 0 ('2.235' is fine)
str2double('1,1')-str2doubleq('1,1') % 9,9 instead 0

Matthias

@Lauri, good work...
Some fine tuning is still required in your function:
str2doubleq('') % NaN instead of 0

Zhibo

Zhibo (view profile)

It'd be superb if you could fix the memory leak problem mentioned by Rob, caused by invoking 'mxArrayToString' without calling 'mxFree' to free the memory later on.

Jan Simon

Jan Simon (view profile)

@Lauri: An excellent speedup! Thanks for this update.
The main time of my suggested one-liner in pure M (which looks less cryptic when it is expanded 3 lines) is wasted by SPRINTF. Using CStr2String (see my FEX page) a smarter pre-allocation allows a faster creation of the long string. This is 20% slower than your str2doubleq.

Isn't it surprising, that your parsing is so much faster than ATOF or STRTOD? What could the compiler manufacturers hide in their implementations? The excellent and fast parsing of Google's V8 engine is worth to be inspected: http://code.google.com/p/double-conversion . Rounding problems are handled smart and reliable, what is a very hard and complicated job.

Some fine tuning is required in your function:
str2doubleq('Inf') % NaN instead of Inf
str2doubleq('.i5') % 5 instead of NaN
str2doubleq('i') % 0 instead of 0 + 1i
str2doubleq('1e1.4') % 0.4 instead of NaN
str2doubleq('--1') % -1 instead of NaN
s = '12345678901234567890';
str2doubleq(s) - str2double(s) % 2048
s = '123.123e40';
str2doubleq(s) - str2double(s) % 1.547e26

Mal-formed input is an evil test, I know. But it would be very fine, if your very efficient implementation would be as reliable as Matlab's STR2DOUBLE.

Zhibo

Zhibo (view profile)

Modifying line 105 to
dval *= pow(10, exp);
seems to work. Great job and thanks again!

Zhibo

Zhibo (view profile)

Thanks for this very useful utility to convert large amounts of text into double arrays; However, the lasted (06 Oct 2012) submission does not work for inputs like ‘str2doubleq('123.45e7')’ any more.

Quant Guy

Quant Guy (view profile)

I submitted the new version of the function with much more efficient algorithm and more neater code.

I think that the new version (after review process) is the most optimal way string to double conversion can be done in any circumstances. Performance gains have risen from about 20x to about 80x-100x!

Also for Jan: New version is much faster than your (cryptic) one liner!

Jan Simon

Jan Simon (view profile)

I still want to stress, that:
Num = reshape(sscanf(sprintf('%s#', CStr{:}), '%g#'), size(CStr))
is about 2.4 times faster than the C-mex approach. I think this is surprising, because the creation of the large string needs a lot of temporary memory. Obviously Matlab's SSCANF is extremely fast. I guess, it avoids the time consuming conversion from mxChar to char. we could do this in C also!

As long as the one-liner in Matlab is faster, I do not find this submission useful.
But it is written nicely and the approach is logical. Therefore I do not give a low rating.

Matlab2010

just to be clear, for a string < 130 char, str2double is quicker.

Matlab2010

as per robs suggestion above:

// X = STR2DOUBLEQ(S) converts the string S, which should be an
// ASCII character representation of a real value, to MATLAB's double
// representation. The string may contain digits,a decimal point,
// a leading + or - sign and 'e' preceding a power of 10 scale factor
//
// X = STR2DOUBLEQ(C) converts the strings in the cell array of strings C
// to double. The matrix X returned will be the same size as C. NaN will
// be returned for any cell which is not a string representing a valid
// scalar value. NaN will be returned for individual cells in C which are
// cell arrays.

// Examples
// str2doubleq('123.45e7')
// str2doubleq('3.14159')
// str2doubleq({'2.71' '3.1415'})
// str2doubleq({'2.71' '3.1415'; 'abc','123.45e7'})

// NOTE ABOUT ATOF:
// To get ultimate performance c-function atof has most optimal performance
// Just a word of caution: atof behaves differently in cases when s
// cannot be interpreted as string in the same sense as Matlabs str2double does
// For example input "2.2a" produces a double number 2.2.
// When you know your input always resembeles true number value, it is "safe" to use atof.
// This is the case for example when you use regexp to capture tokens that are always
// by construction in numeric form, e.g (\d+)

#include "mex.h"
#include<string>
#include<sstream>

double string_to_double( const char *s )
{
// If you uncomment this, make the rest of the code in this function
// block commented. Please read the note above about atof usage.
// return atof(s);

static std::istringstream iss;
iss.clear(); iss.str(s);
double x;
iss >> x;
if(!(iss && (iss >> std::ws).eof()))
{
return mxGetNaN();
}
return x;
}

void mexFunction( int nlhs, mxArray *plhs[],
int nrhs, const mxArray*prhs[] )

{
double *writePtr;
char *strPtr;

if ( nrhs == 0 )
{
mexErrMsgTxt("Too few input arguments");
}
else if ( nrhs >= 2 )
{
mexErrMsgTxt("Too many input arguments.");
}
if ( mxIsChar(prhs[0]) )
{
// branch to handle chars
// get pointer to the beginning of the char
strPtr = mxArrayToString(prhs[0]);
// allocate memory to output
plhs[0] = mxCreateDoubleMatrix(1,1, mxREAL);
// set pointer to beginning of the memory
writePtr = mxGetPr(plhs[0]);

*(writePtr) = string_to_double(strPtr);
mxFree(strPtr);
}
else if ( mxIsCell(prhs[0]) )
{

mwSize mrows,ncols,i;
mrows = mxGetM( prhs[0] );
ncols = mxGetN( prhs[0] );
// allocate memory to results
plhs[0] = mxCreateDoubleMatrix(mrows,ncols, mxREAL);

writePtr = mxGetPr(plhs[0]);
// get pointer to the beginning of array

for (i = 0; i < mrows*ncols; i++)
{
mxArray *Context = mxGetCell(prhs[0],i);
if ( Context == 0 || !mxIsChar(Context) )
{
*(writePtr+i) = mxGetNaN();
}
else
{
char *strPtr = mxArrayToString(Context);
if (strPtr != 0)
{
*(writePtr+i) = string_to_double(strPtr);
}
else
{
*(writePtr+i) = mxGetNaN();
}
mxFree(strPtr);
}
}
}
else if ( mxIsDouble(prhs[0]) )
{
// return vector of NaN's
mwSize mrows,ncols,i;
mrows = mxGetM( prhs[0] );
ncols = mxGetN( prhs[0] );
if (mrows == 0 && ncols == 0)
{
// Case where input is empty array must return NaN value
mrows = 1; ncols = 1;
}
plhs[0] = mxCreateDoubleMatrix(mrows,ncols, mxREAL);
writePtr = mxGetPr(plhs[0]);
for (i = 0; i < mrows*ncols; i++)
{
*(writePtr+i) = mxGetNaN();
}
}
else
{
// case to handle other situations, eg input is a class etc....
// allocate memory to output
plhs[0] = mxCreateDoubleMatrix(1,1, mxREAL);
// get pointer to the beginning of the allocated memory
writePtr = mxGetPr(plhs[0]);
// write NaN to the first element of it
writePtr[0] = mxGetNaN();
}
};

Rob Ewalds

Excellent utility, thanks!
However, frequent calls to 'str2doubleq' revealed a memory leak:

'mxArrayToString' does not free the dynamic memory that the char pointer points to. Consequently, you should typically free the string (using mxFree) immediately after you have finished using it:
http://amath.colorado.edu/computing/Matlab/OldTechDocs/apiref/mxarraytostring.html

Your code features 2 calls to 'mxArrayToString' (lines 68, 98).
Adding the statement 'mxFree(strPtr);' on lines 75 and 107 and recompiling resolves the leak.

We stumbled upon this while reading a 150.000 lines ASCII file, calling 'str2doubleq' for every line: heavily draining MATLAB's available memory.

Now it works fine, thanks again for this highly useful routine.

Brian Emery

More than an order of magnitude faster! Very useful for reading large amounts of text. Clear instructions as well. Thanks for posting this!

Jan Simon

Jan Simon (view profile)

Some further tests with other parsers in your program:
strtod: 0.13 sec
sscanf: 0.16 sec

Another remark: "str2double('2.7i - 3-14')" is confusing as an example: this does not work with str2doubleq.

Jan Simon

Jan Simon (view profile)

Good idea and fast. Therefore it is really useful.

Some remarks:
1. The examples do not use your function, but Matlab's STR2DOUBLE.
2. Why do you treat DOUBLE as input different from other invalid inputs: DOUBLE=>NaN-matrix, SINGLE=>Scalar NaN?
3. Calling the function with not-initialized cell elements cause a NULL-Pointer exception: str2double(cell(1, 3)). Strange, but it is helpful to check for NULL after mxGetCell ever.
4. The conversion from the mxChar (unicode) to C-Strings wastes time. Is there a C++-function, which parses a Unicode string also?
5. Please mention in the help section, that input cells with >2 dimensions reply a matrix. Or let the function reply an array with the same dimensions as the input cell.
6. str2doubleq('Inf') replies NaN.
7. If you restrict the input to real values, you can parse a cell string in a different way:
d = reshape(sscanf(sprintf('%s#', c{:}), '%g#'), size(c));
For a {1 x 1000} cell string filled by sprintf('%.15g', rand) I get these timings on Matlab2009a, 1.5GHz PentiumM:
STR2DOUBLE: 2.03 sec
STR2DOUBLEQ: 0.44 sec
SSCANF(SPRINTF)): 0.13 sec
And if you let CStr2String create the long string, it takes just 0.06 sec. Surprising! Your function looks so much more efficient looking at the code. So I assume, the istringstream of my MSVC2008 must be a wreck. I'll try to use the old sscanf in C.

Updates

1.8

*Fixed a bug with scientific notation. Thanks for the feedback.

1.7

-Implemented "hand massaged" highly efficient parser.
-Added support to parse also complex numbers
-Restructured code to being more neat

1.3

* Thanks to Jan Simons feedback, fixed some bugs and documentation. Also was able to tweak about 35% performance boost compared to earlier implementation.

MATLAB Release
MATLAB 7.11 (R2010b)
Acknowledgements

Inspired: Faster alternative to builtin str2double

Download apps, toolboxes, and other File Exchange content using Add-On Explorer in MATLAB.

» Watch video