Thread Subject: String parsing

Subject: String parsing

From: Heywood

Date: 4 Feb, 2009 18:03:02

Message: 1 of 4

I've run into a parsing task that is driving me nuts. I have a string like this:

   line = '125746.100,A,010.0600,N,01000.31,E,0.00,0.00,020506,,,A'

and would like to parse it into a vector like this:

   [125746.1 65 10.06 78 1000.31 5 0 0 20506 NaN NaN 65]

that is, floats and integers get parsed directly, letters get parsed as their ASCII values, and null fields (consecutive delimiters) get parsed as NaNs. It's perfectly OK for the result vector to be all doubles -- so no problem rendering integers as floats. But because this task will be parsing huge files, this needs to be as fast as possible.

The tricky thing is that nulls can appear in fields that, when populated, are either floats or characters. So simply replacing ',,' with something like ',NaN,' using STRREP won't work, since the parsing will stop at the first place where a %c specifier encounters a NaN.

The fastest almost-working solution I've found so far is along the lines of

   sscanf(line,'%f,%c,%f,%c,%f,%c,%f,%f,%f,%f,%c,%c')'

but that stops parsing at the first null field (after 020506). Replacing the %c specifiers with %s doesn't help.

The only fully-working solution I've found so far is:

   dummy = textscan(regexprep(line(8:end-3),',',char(1)),'%s','delimiter',char(1));
   ind = find(~strcmpi(dummy{:}','') & isnan(str2double(dummy{:}')));
   result = str2double(dummy{:})'; result(ind) = double(cell2mat(dummy{1}(ind)));

... but a tic/toc timing test shows this to be about 60X slower than attempts using just SSCANF and/or TEXTSCAN.

Can anyone suggest a faster way to accomplish the above, without all the find/str2double/cell2mat gymnastics?

Gratefully,

HJ

Subject: String parsing

From: Jos

Date: 4 Feb, 2009 18:39:02

Message: 2 of 4

"Heywood " <heywoodj123@yahoo.com> wrote in message <gmcl8m$5mq$1@fred.mathworks.com>...
> I've run into a parsing task that is driving me nuts. I have a string like this:
>
> line = '125746.100,A,010.0600,N,01000.31,E,0.00,0.00,020506,,,A'
>
> and would like to parse it into a vector like this:
>
> [125746.1 65 10.06 78 1000.31 5 0 0 20506 NaN NaN 65]
>
> that is, floats and integers get parsed directly, letters get parsed as their ASCII values, and null fields (consecutive delimiters) get parsed as NaNs. It's perfectly OK for the result vector to be all doubles -- so no problem rendering integers as floats. But because this task will be parsing huge files, this needs to be as fast as possible.
>
> The tricky thing is that nulls can appear in fields that, when populated, are either floats or characters. So simply replacing ',,' with something like ',NaN,' using STRREP won't work, since the parsing will stop at the first place where a %c specifier encounters a NaN.
>
> The fastest almost-working solution I've found so far is along the lines of
>
> sscanf(line,'%f,%c,%f,%c,%f,%c,%f,%f,%f,%f,%c,%c')'
>
> but that stops parsing at the first null field (after 020506). Replacing the %c specifiers with %s doesn't help.
>
> The only fully-working solution I've found so far is:
>
> dummy = textscan(regexprep(line(8:end-3),',',char(1)),'%s','delimiter',char(1));
> ind = find(~strcmpi(dummy{:}','') & isnan(str2double(dummy{:}')));
> result = str2double(dummy{:})'; result(ind) = double(cell2mat(dummy{1}(ind)));
>
> ... but a tic/toc timing test shows this to be about 60X slower than attempts using just SSCANF and/or TEXTSCAN.
>
> Can anyone suggest a faster way to accomplish the above, without all the find/str2double/cell2mat gymnastics?
>
> Gratefully,
>
> HJ

Does this give you something to work on?


STR = '1.100,A,010.0600,N,010.31,E,0.00,0.00,0206,,,A' ;
s = strread(STR,'%s','delimiter',',') ; % read in as strings (using e.g., textread)

f = str2double(s) ; % retrieve floats and integers
q = isnan(f) & ~cellfun('isempty',s) ; % position of ascii characters
f(q) = [s{q}] ;

% result
f.'

hth
Jos

Subject: String parsing

From: Matt Fig

Date: 4 Feb, 2009 18:50:17

Message: 3 of 4

  line = '125746.100,A,010.0600,N,01000.31,E,0.00,0.00,020506,,,A'
OUT = sscanf(line,'%f,%c,%f,%c,%f,%c,%f,%f,%f,%c%c%c,%c')';
 OUT(OUT==44)=NaN





gf_jcaf}rs_mkm_}cgmlmmKjim`%`8}fr}e}lar_n,_c}w>tW_}ksC%cn}_

Subject: String parsing

From: Heywood

Date: 5 Feb, 2009 19:38:02

Message: 4 of 4

"Jos " <#10584@fileexchange.com> wrote :

> Does this give you something to work on?
>
>
> STR = '1.100,A,010.0600,N,010.31,E,0.00,0.00,0206,,,A' ;
> s = strread(STR,'%s','delimiter',',') ; % read in as strings (using e.g., textread)
>
> f = str2double(s) ; % retrieve floats and integers
> q = isnan(f) & ~cellfun('isempty',s) ; % position of ascii characters
> f(q) = [s{q}] ;

Yes, that's exactly what I was trying to do -- thanks! I informally timed STRREAD against TEXTSCAN and the latter is about 10-15% faster, even though you have to do a s = s{:} afterward to get the same result. Hadn't known about CELLFUN before -- that's very helpful too.

Cheers,

/HJ

Tags for this Thread

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

rssFeed for this Thread

Contact us at files@mathworks.com