Code covered by the BSD License  

Highlights from
Fast String to Double Conversion

4.57143

4.6 | 8 ratings Rate this file 44 Downloads (last 30 days) File Size: 16.7 KB File ID: #28893
image thumbnail

Fast String to Double Conversion

by

 

29 Sep 2010 (Updated )

str2doubleq converts text to double like Matlab's str2double,but up to 400x faster! multithreaded.

| Watch this File

File Information
Description

str2doubleq is equivalent to the Matlab built-in str2double function that converts char or cellstr array to appropriate double arrays. The drawback of built-in str2double is that it becomes very slow when the dataset becomes larger.

str2doubleq exploits C++ fast string handling capabilities. Also if you have a compiler supporting new C++11 standard or you have Boost libraries installed on your computer, you can use the multithreaded algorithm. Multithreaded algorithm scales very well if data set is sufficiently large.

Function has been programmed exactly to the same behavior as str2double.

Original demand for the function has arisen from certain market data parsing problems that had to be done in real time. Now Matlab can be as fast as traditional programming languages in these types of string parsing problems.

Installation:

*Copy the file str2doubleq.cpp somewhere in hard drive. (Example C:\Test\str2doubleq.cpp)

*Launch Matlab and compile the source file to generate machine dependent binary. If you have not selected a compiler this needs to be done first (run mex -setup in command window).

* Source is compiled typing mex <c-source folder>
(Example mex C:\Test\str2doubleq.cpp)
 
*Place the generated str2doubleq.mexw32 (32-bit) or str2doubleq.mexw64 (64-bit) to Matlab's scope (set path- folder group)

*If you want to increase performance even more, then uncomment the line 35 from str2doubleq.cpp (containing #define USE_PARALLEL_ALGORITHM). Remeber that you need to have modern enough compiler or Boost (http://www.boost.org/) installed.

Now you can use the function in normal matlab fashion. Run the testcases script test_str_to_double_performance.m (included in zip-file)

MATLAB release MATLAB 7.11 (R2010b)
Tags for This File   Please login to tag files.
Please login to add a comment or rating.
Comments and Ratings (16)
28 Jan 2014 Christophe Trefois

@Matthias,

This is by design of the isreal function.

From the doc,:
If A has a stored imaginary part of value 0, isreal(A) returns logical 0 (false).

You may however expect that the returned number is not complex when the imaginary part is 0.

22 Apr 2013 Jan Simon

The idea is very good, but the results are not reliable. I cannot suggest to use this for productive work. A fair rating is not easy in this case, therefore I've hesitated for some years now.

07 Feb 2013 Matthias

Hello Lauri,
there are still some differences
isreal(str2doubleq('1')) % 0 instead of 1
str2double('2.236')-str2doubleq('2.236') % is not 0 ('2.235' is fine)
str2double('1,1')-str2doubleq('1,1') % 9,9 instead 0

29 Jan 2013 Matthias

@Lauri, good work...
Some fine tuning is still required in your function:
str2doubleq('') % NaN instead of 0

31 Oct 2012 Zhibo

It'd be superb if you could fix the memory leak problem mentioned by Rob, caused by invoking 'mxArrayToString' without calling 'mxFree' to free the memory later on.

09 Oct 2012 Jan Simon

@Lauri: An excellent speedup! Thanks for this update.
The main time of my suggested one-liner in pure M (which looks less cryptic when it is expanded 3 lines) is wasted by SPRINTF. Using CStr2String (see my FEX page) a smarter pre-allocation allows a faster creation of the long string. This is 20% slower than your str2doubleq.

Isn't it surprising, that your parsing is so much faster than ATOF or STRTOD? What could the compiler manufacturers hide in their implementations? The excellent and fast parsing of Google's V8 engine is worth to be inspected: http://code.google.com/p/double-conversion . Rounding problems are handled smart and reliable, what is a very hard and complicated job.

Some fine tuning is required in your function:
str2doubleq('Inf') % NaN instead of Inf
str2doubleq('.i5') % 5 instead of NaN
str2doubleq('i') % 0 instead of 0 + 1i
str2doubleq('1e1.4') % 0.4 instead of NaN
str2doubleq('--1') % -1 instead of NaN
s = '12345678901234567890';
str2doubleq(s) - str2double(s) % 2048
s = '123.123e40';
str2doubleq(s) - str2double(s) % 1.547e26

Mal-formed input is an evil test, I know. But it would be very fine, if your very efficient implementation would be as reliable as Matlab's STR2DOUBLE.

09 Oct 2012 Zhibo

Modifying line 105 to
dval *= pow(10, exp);
seems to work. Great job and thanks again!

09 Oct 2012 Zhibo

Thanks for this very useful utility to convert large amounts of text into double arrays; However, the lasted (06 Oct 2012) submission does not work for inputs like ‘str2doubleq('123.45e7')’ any more.

06 Oct 2012 Quant Guy

I submitted the new version of the function with much more efficient algorithm and more neater code.

I think that the new version (after review process) is the most optimal way string to double conversion can be done in any circumstances. Performance gains have risen from about 20x to about 80x-100x!

Also for Jan: New version is much faster than your (cryptic) one liner!

05 Oct 2012 Jan Simon

I still want to stress, that:
Num = reshape(sscanf(sprintf('%s#', CStr{:}), '%g#'), size(CStr))
is about 2.4 times faster than the C-mex approach. I think this is surprising, because the creation of the large string needs a lot of temporary memory. Obviously Matlab's SSCANF is extremely fast. I guess, it avoids the time consuming conversion from mxChar to char. we could do this in C also!

As long as the one-liner in Matlab is faster, I do not find this submission useful.
But it is written nicely and the approach is logical. Therefore I do not give a low rating.

10 Jul 2012 none

just to be clear, for a string < 130 char, str2double is quicker.

10 Jul 2012 none

as per robs suggestion above:

// X = STR2DOUBLEQ(S) converts the string S, which should be an
// ASCII character representation of a real value, to MATLAB's double
// representation. The string may contain digits,a decimal point,
// a leading + or - sign and 'e' preceding a power of 10 scale factor
//
// X = STR2DOUBLEQ(C) converts the strings in the cell array of strings C
// to double. The matrix X returned will be the same size as C. NaN will
// be returned for any cell which is not a string representing a valid
// scalar value. NaN will be returned for individual cells in C which are
// cell arrays.

// Examples
// str2doubleq('123.45e7')
// str2doubleq('3.14159')
// str2doubleq({'2.71' '3.1415'})
// str2doubleq({'2.71' '3.1415'; 'abc','123.45e7'})

// NOTE ABOUT ATOF:
// To get ultimate performance c-function atof has most optimal performance
// Just a word of caution: atof behaves differently in cases when s
// cannot be interpreted as string in the same sense as Matlabs str2double does
// For example input "2.2a" produces a double number 2.2.
// When you know your input always resembeles true number value, it is "safe" to use atof.
// This is the case for example when you use regexp to capture tokens that are always
// by construction in numeric form, e.g (\d+)

#include "mex.h"
#include<string>
#include<sstream>

double string_to_double( const char *s )
{
// If you uncomment this, make the rest of the code in this function
// block commented. Please read the note above about atof usage.
// return atof(s);

static std::istringstream iss;
iss.clear(); iss.str(s);
double x;
iss >> x;
if(!(iss && (iss >> std::ws).eof()))
{
return mxGetNaN();
}
return x;
}

void mexFunction( int nlhs, mxArray *plhs[],
int nrhs, const mxArray*prhs[] )

{
double *writePtr;
char *strPtr;

if ( nrhs == 0 )
{
mexErrMsgTxt("Too few input arguments");
}
else if ( nrhs >= 2 )
{
mexErrMsgTxt("Too many input arguments.");
}
if ( mxIsChar(prhs[0]) )
{
// branch to handle chars
// get pointer to the beginning of the char
strPtr = mxArrayToString(prhs[0]);
// allocate memory to output
plhs[0] = mxCreateDoubleMatrix(1,1, mxREAL);
// set pointer to beginning of the memory
writePtr = mxGetPr(plhs[0]);

*(writePtr) = string_to_double(strPtr);
mxFree(strPtr);
}
else if ( mxIsCell(prhs[0]) )
{

mwSize mrows,ncols,i;
mrows = mxGetM( prhs[0] );
ncols = mxGetN( prhs[0] );
// allocate memory to results
plhs[0] = mxCreateDoubleMatrix(mrows,ncols, mxREAL);

writePtr = mxGetPr(plhs[0]);
// get pointer to the beginning of array

for (i = 0; i < mrows*ncols; i++)
{
mxArray *Context = mxGetCell(prhs[0],i);
if ( Context == 0 || !mxIsChar(Context) )
{
*(writePtr+i) = mxGetNaN();
}
else
{
char *strPtr = mxArrayToString(Context);
if (strPtr != 0)
{
*(writePtr+i) = string_to_double(strPtr);
}
else
{
*(writePtr+i) = mxGetNaN();
}
mxFree(strPtr);
}
}
}
else if ( mxIsDouble(prhs[0]) )
{
// return vector of NaN's
mwSize mrows,ncols,i;
mrows = mxGetM( prhs[0] );
ncols = mxGetN( prhs[0] );
if (mrows == 0 && ncols == 0)
{
// Case where input is empty array must return NaN value
mrows = 1; ncols = 1;
}
plhs[0] = mxCreateDoubleMatrix(mrows,ncols, mxREAL);
writePtr = mxGetPr(plhs[0]);
for (i = 0; i < mrows*ncols; i++)
{
*(writePtr+i) = mxGetNaN();
}
}
else
{
// case to handle other situations, eg input is a class etc....
// allocate memory to output
plhs[0] = mxCreateDoubleMatrix(1,1, mxREAL);
// get pointer to the beginning of the allocated memory
writePtr = mxGetPr(plhs[0]);
// write NaN to the first element of it
writePtr[0] = mxGetNaN();
}
};

25 Jul 2011 Rob Ewalds

Excellent utility, thanks!
However, frequent calls to 'str2doubleq' revealed a memory leak:

'mxArrayToString' does not free the dynamic memory that the char pointer points to. Consequently, you should typically free the string (using mxFree) immediately after you have finished using it:
http://amath.colorado.edu/computing/Matlab/OldTechDocs/apiref/mxarraytostring.html

Your code features 2 calls to 'mxArrayToString' (lines 68, 98).
Adding the statement 'mxFree(strPtr);' on lines 75 and 107 and recompiling resolves the leak.

We stumbled upon this while reading a 150.000 lines ASCII file, calling 'str2doubleq' for every line: heavily draining MATLAB's available memory.

Now it works fine, thanks again for this highly useful routine.

30 Nov 2010 Brian Emery

More than an order of magnitude faster! Very useful for reading large amounts of text. Clear instructions as well. Thanks for posting this!

30 Sep 2010 Jan Simon

Some further tests with other parsers in your program:
strtod: 0.13 sec
sscanf: 0.16 sec

Another remark: "str2double('2.7i - 3-14')" is confusing as an example: this does not work with str2doubleq.

30 Sep 2010 Jan Simon

Good idea and fast. Therefore it is really useful.

Some remarks:
1. The examples do not use your function, but Matlab's STR2DOUBLE.
2. Why do you treat DOUBLE as input different from other invalid inputs: DOUBLE=>NaN-matrix, SINGLE=>Scalar NaN?
3. Calling the function with not-initialized cell elements cause a NULL-Pointer exception: str2double(cell(1, 3)). Strange, but it is helpful to check for NULL after mxGetCell ever.
4. The conversion from the mxChar (unicode) to C-Strings wastes time. Is there a C++-function, which parses a Unicode string also?
5. Please mention in the help section, that input cells with >2 dimensions reply a matrix. Or let the function reply an array with the same dimensions as the input cell.
6. str2doubleq('Inf') replies NaN.
7. If you restrict the input to real values, you can parse a cell string in a different way:
d = reshape(sscanf(sprintf('%s#', c{:}), '%g#'), size(c));
For a {1 x 1000} cell string filled by sprintf('%.15g', rand) I get these timings on Matlab2009a, 1.5GHz PentiumM:
STR2DOUBLE: 2.03 sec
STR2DOUBLEQ: 0.44 sec
SSCANF(SPRINTF)): 0.13 sec
And if you let CStr2String create the long string, it takes just 0.06 sec. Surprising! Your function looks so much more efficient looking at the code. So I assume, the istringstream of my MSVC2008 must be a wreck. I'll try to use the old sscanf in C.

Updates
01 Oct 2010

* Thanks to Jan Simons feedback, fixed some bugs and documentation. Also was able to tweak about 35% performance boost compared to earlier implementation.

08 Oct 2012

-Implemented "hand massaged" highly efficient parser.
-Added support to parse also complex numbers
-Restructured code to being more neat

10 Oct 2012

*Fixed a bug with scientific notation. Thanks for the feedback.

Contact us