# Improve the performance of a function based on str2double

3 views (last 30 days)

Show older comments

Hi all, I have a function that given a line of text coming from a TXT file containing information of the type:

LINE1: N1 A2 X5.45 Y4.45 Z-10.25 ;TEXT

LINE2: N3 A3 X1.45 ;TEXT

...

After the ;TEXT there could be more information of the same type that would not have to be taken into account, for example:

LINE3: N1 A2 X-5.5 Y9.35 Z-1.5 ;X25 Y-4.44

I give two example lines to try to show that not all lines always contain the same information.

And what I want to obtain is in a matrix (for example A) the information that appears after X, Y or Z and NaN if it does not contain that information. For the example A should be:

A = [5.45 4.45 -10.25

1.45 NaN NaN

-5.5 9.35 -1.5];

The function I am using is the one shown in coordinatesCHAR by entering in tline the line of text in question and in matchWords a cell that would be for this case: matchWords = {'X','Y','Z'};

When the number of lines is low, the processing time is relatively high, but of course, the text files I am working with have some thousands of lines and it is not productive.

I was able to verify that the slowest functions were str2double and regexp. Does someone know how can I improve this?

function XYZ = coordinatesCHAR(tline,matchWords)

% Regulor expression to find matchcase letter.

[a,b] = regexp(tline,'[+-]?\d+(\.\d+)?');

XYZ = NaN(1,length(matchWords));

for ii = 1:length(matchWords)

isfind = strfind(tline,matchWords{ii});

if ~isempty(isfind) && ~isempty(a) && ~isempty(b)

% If isfind has more than one component take the first position

strPos = find(a == isfind(1)+1);

if isempty(strPos)

XYZ(1,ii) = NaN;

else

XYZ(1,ii) = str2double(tline(a(strPos):b(strPos))); % Get the value upto next character

end

end

end

I searched in different forums and tried using the "str2doubleq" function, but the improvement was minimal.

Thank you so much for all.

##### 3 Comments

### Accepted Answer

dpb
on 30 Jan 2021

Edited: dpb
on 30 Jan 2021

"Deadahead" solution without any attempt to use anything fancy...regular expressions are known to be expensive; I've never compared/timed relative to the new string functions to know where they stack up...

function ret=coordinatesCHAR(tline,vars)

% for input line beginning with text, may have trailing comments after semicolon

ret=nan(1,numel(vars));

if contains(tline,';'), tline=extractBefore(tline,';'); end

t=split(tline);

t=t(contains(t,vars));

v=cellfun(@(s)sscanf(s(2:end),'%f'),t);

ix=contains(vars,cellfun(@(s)s(1),t,'uni',0));

ret(ix)=v;

end

For the sample

>> A=[];

>> for i=1:numel(txt),A=[A;coordinatesCHAR(txt(i),vars)];end

>> A

A =

5.4500 4.4500 -10.2500

1.4500 NaN NaN

-5.5000 9.3500 -1.5000

>>

Revised above tested with

> txt

txt =

3×1 cell array

{'LINE1: N1 A2 X5.45 Y4.45 Z-10.25 ;TEXT' }

{'LINE2: N3 A3 X1.45 ' }

{'LINE3: N1 A2 X-5.5 Y9.35 Z-1.5 ;X25 Y-4.44'}

>>

w/o the trailing semicolon. The leading "LINE" is immaterial, actually; just has a little longer string this way but the logic still works.

>> txt

txt =

3×1 cell array

{'N1 A2 X5.45 Y4.45 Z-10.25 ;TEXT' }

{'N3 A3 X1.45' }

{'N1 A2 X-5.5 Y9.35 Z-1.5 ;X25 Y-4.44'}

>> A=[];

>> for i=1:numel(txt),A=[A;coordinatesCHAR(txt(i),vars)];end

>> A

A =

5.4500 4.4500 -10.2500

1.4500 NaN NaN

-5.5000 9.3500 -1.5000

>>

##### 10 Comments

dpb
on 30 Jan 2021

I knew the cellfun probably would be something; I'm suprised the one w/ sscanf is the top dog there, though.

One can look at how to break those down some; as noted it was definitely all at the highest level first.

### More Answers (1)

dpb
on 31 Jan 2021

Edited: dpb
on 31 Jan 2021

Variations upon a theme -- this is almost 2X as fast as my previous...here it times out as just a fraction ahead of the original; not sure can beat that by much without mex after this experiment; at least nothing comes to me that would be markedly faster.

The high-level overhead of the cellfun and string data type user-friendly functions are all taken out of the following; str2double calls sscanf to do the work so using it is going backwards (but by surprisingly little) by adding the calling overhead.

regexp pulling tokens turns out to be essentially as fast as using the builtin strfind on each sequentially; that did surprise me somewhat; I wasn't surprised the first try with user-friendly stuff wasn't a performance demon but I expected that getting rid of regexp would show more benefit.

function ret=coordinatesCHAR4(tline,vars)

ret=nan(1,numel(vars));

if contains(tline,';'), tline=extractBefore(tline,';'); end

tline=char(tline);

for i=1:numel(vars)

i1=strfind(tline,vars(i))+1;

if isempty(i1), continue, end

i2=i1+strfind(tline(i1+1:end),' ')-1;

if isempty(i2), i2=length(tline); end

ret(i)=sscanf(tline(i1:i2),'%f');

end

end

One can make just a couple of refinements to the original --

function XYZ = coordinatesCHAR(tline,matchWords)

% Regulor expression to find matchcase letter.

XYZ=nan(1,length(matchWords));

[a,b] = regexp(tline,'[+-]?\d+(\.\d+)?');

if isempty(a), return, end % no tokens found; return

for ii = 1:length(matchWords)

isfind = strfind(tline,matchWords{ii});

if isempty(isfind), continue, end

% If isfind has more than one component take the first position

strPos = find(a==isfind(1)+1);

if isempty(strPos), continue, end

XYZ(1,ii)=sscanf(tline(a(strPos):b(strPos)),'%f'); % Get the value upto next character

end

end

The above just rearranges the logical tests a little and elimates the duplicate storing of a NaN for missing variable that was in the else clause since the array has already been initialized.

> tic;for n=1:10000;for i=1:numel(txt),A=coordinatesCHAR0(txt{i},vars);end;end;toc

Elapsed time is 1.976065 seconds.

>> tic;for n=1:10000;for i=1:numel(txt),A=coordinatesCHAR4(txt{i},vars);end;end;toc

Elapsed time is 1.797489 seconds.

>>

"0" is the above modified original, "4" is mine last submittal above...

##### 7 Comments

dpb
on 1 Feb 2021

dpb
on 1 Feb 2021

i2=i1+strfind(tline(i1+1:end),' ')-1;

if isempty(i2), i2=length(tline); end

If you can assure there is at least one blank at the end of the line, the above test/fixup could be eliminated. Whether it would speed up the result much or not I don't know, didn't try it.

I debated adding a blank just to be sure but didn't try that, either...

### See Also

### Categories

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!