Why is isfield faster than intersect for strings?

Question

Sebastian Kraemer on 9 Mar 2018

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/387324-why-is-isfield-faster-than-intersect-for-strings

Edited: Jan on 10 Mar 2018

I need to intersect moderately sized cells containing strings, say length 5. Using Matlabs function intersect (with or without 'stable') turns out to be up to 10 times slower than the following, quite weird lines of code. And even for large cells, this code is faster (using Matlab 2017a).

 a = {'a1','c2','e3','g4','i5'};
 b = {'a1','b2','c2','d2','e3'};
 % for i = 1:1000
    b2 = [b;b];
    c = struct(b2{:}); 
    cisfieldina = isfield(c,a);
    v = a(cisfieldina); % = intersect(a,b,'stable')
    w = a(~cisfieldina); % = setdiff(a,b,'stable') (basically for free) 
 % end

Intersect can not have such grave overhead. I tried out different ways to use struct, but this seems to be the fastest way, and it is stable as well.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Jan on 9 Mar 2018

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/387324-why-is-isfield-faster-than-intersect-for-strings#answer_309254

Edited: Jan on 9 Mar 2018

Open in MATLAB Online

As you have found out already, intersect has a remarkable overhead. It checks, if the inputs are sorted, and if they aren't, it sorts them for a faster binary search. For 5 strings this is much slower than a linear search.

For a faster implementation see e.g. the C-mex function FEX: CStrAinBP. This replies the indices of all strings occurring in both cell strings. Here is an equivalent M-code:

function [AI, BI] = CStrAinBP(A, B)
nA = numel(A);
nB = numel(B);
if nA <= nB
   % Collect the index of the first occurrence in B for every A:
   M = zeros(1, nA);
   for iA = 1:nA
      Ind = find(strcmp(A{iA}, B), 1);
      if isempty(Ind) == 0
         M(iA) = Ind(1);
      end
   end
else  % nB <= nA, B is smaller, so it is cheaper to loop over B:
   % Mark every A which equal the current B
   M = zeros(1, nA);
   for iB = nB:-1:1
      M(strcmp(B{iB}, A)) = iB;  % Same as M(find(strcmp()))
   end
end
AI = find(M);  % If any occurrence was found, this A exists...
BI = M(AI);    % at this index in B
end

And if the cell of intersecting objects is needed, use this in the caller:

C = A(AI);

The code can be simplified, if you only need the common elements:

function AB = CStrCommon(A, B)
nA = numel(A);
nB = numel(B);
M  = false(1, nA);
if nA <= nB
   for iA = 1:nA
      M(iA) = any(strcmp(A{iA}, B));
   end     
else  % nB <= nA, B is smaller, so it is cheaper to loop over B:
   for iB = 1:nB
      M(strcmp(A, B{iB})) = true;
   end
end
AB = A(M);
end

UNTESTED CODE Both versions have been adjusted for the publication in the forum, which might have inserted bugs.

Matlab's function for handling sets are very powerful and optimized for huge data sets and handling different input types. But a specific code can be much faster for a small inputs.

6 Comments
Show 4 older commentsHide 4 older comments

Sebastian Kraemer on 10 Mar 2018

Open in MATLAB Online

Sorry Jan if I was not making myself clear (working Friday afternoon is my excuse I guess). I kind of figured out my needs in the process (considering that the original post was just about intersect). I needed a function that takes two cells of strings a and b and outputs the same as intersect(a,b), setdiff(a,b) and setdiff(b,a).

Regarding my previous question, I was just wondering if I could actually overtake your codes exactly as they are, being implemented as mex files. Since that does not seem to be the case, as far as I can see, I indeed modified your code to suit my needs. I settled for this now:

 function [acapb,awob,bwoa,IA,IB] = str_intersect(a,b)
% STR_INTERSECT  Equivalent to intersect(a,b), setdiff(a,b)
% and setdiff(b,a) for string cells a,b
%
%   [acapb,awob,bwoa] = STR_INTERSECT(a,b) returns
%   acapb = intersect(a,b)
%   awob = setdiff(a,b) % if asked for
%   bwoa = setdiff(b,a) % if asked for
%   but without overhead (and relying on correct input).
%
%   [acapb,awob,bwoa,IA,IB] = STR_INTERSECT(a,b) also returns
%   the logical indices IA, IB for which acapb = a(IA) = b(IB).
%
%   See also: INTERSECT, SETDIFF, INT_INTERSECT
    IA = false(size(a));
    IB = false(size(b));
    for j = 1:numel(a)
        cmp = strcmp(a{j},b);
        IA(j) = any(cmp);
        IB = IB | cmp;
    end
    acapb = a(IA);
    if nargout > 1
        awob = a(~IA);
    end
    if nargout > 2
        bwoa = b(~IB);
    end
 end

This is a basic and sufficient implementation for me. Thank you for your input on the problem! (Of course I could ask as well which cell is smaller, but they use to have same size.)

Jan on 10 Mar 2018

Edited: Jan on 10 Mar 2018

@Sebastian: You are welcome. The final code looks straight and tight. I think it is very efficient for small cell strings, where "small" is maybe 100 strings.

There is no need for excuses. I tried to help you efficiently and you have been the one, who needs the solution. I'm glad, if I could help you.

Sign in to comment.

Why is isfield faster than intersect for strings?

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

6 Comments
Show 4 older commentsHide 4 older comments

More Answers (0)

See Also

Categories

Tags

Products

Community Treasure Hunt

Why is isfield faster than intersect for strings?

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

6 Comments Show 4 older commentsHide 4 older comments

More Answers (0)

See Also

Categories

Tags

Products

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

6 Comments
Show 4 older commentsHide 4 older comments