Why is isfield faster than intersect for strings?
2 views (last 30 days)
Show older comments
I need to intersect moderately sized cells containing strings, say length 5. Using Matlabs function intersect (with or without 'stable') turns out to be up to 10 times slower than the following, quite weird lines of code. And even for large cells, this code is faster (using Matlab 2017a).
a = {'a1','c2','e3','g4','i5'};
b = {'a1','b2','c2','d2','e3'};
% for i = 1:1000
b2 = [b;b];
c = struct(b2{:});
cisfieldina = isfield(c,a);
v = a(cisfieldina); % = intersect(a,b,'stable')
w = a(~cisfieldina); % = setdiff(a,b,'stable') (basically for free)
% end
Intersect can not have such grave overhead. I tried out different ways to use struct, but this seems to be the fastest way, and it is stable as well.
0 Comments
Accepted Answer
Jan
on 9 Mar 2018
Edited: Jan
on 9 Mar 2018
As you have found out already, intersect has a remarkable overhead. It checks, if the inputs are sorted, and if they aren't, it sorts them for a faster binary search. For 5 strings this is much slower than a linear search.
For a faster implementation see e.g. the C-mex function FEX: CStrAinBP. This replies the indices of all strings occurring in both cell strings. Here is an equivalent M-code:
function [AI, BI] = CStrAinBP(A, B)
nA = numel(A);
nB = numel(B);
if nA <= nB
% Collect the index of the first occurrence in B for every A:
M = zeros(1, nA);
for iA = 1:nA
Ind = find(strcmp(A{iA}, B), 1);
if isempty(Ind) == 0
M(iA) = Ind(1);
end
end
else % nB <= nA, B is smaller, so it is cheaper to loop over B:
% Mark every A which equal the current B
M = zeros(1, nA);
for iB = nB:-1:1
M(strcmp(B{iB}, A)) = iB; % Same as M(find(strcmp()))
end
end
AI = find(M); % If any occurrence was found, this A exists...
BI = M(AI); % at this index in B
end
And if the cell of intersecting objects is needed, use this in the caller:
C = A(AI);
The code can be simplified, if you only need the common elements:
function AB = CStrCommon(A, B)
nA = numel(A);
nB = numel(B);
M = false(1, nA);
if nA <= nB
for iA = 1:nA
M(iA) = any(strcmp(A{iA}, B));
end
else % nB <= nA, B is smaller, so it is cheaper to loop over B:
for iB = 1:nB
M(strcmp(A, B{iB})) = true;
end
end
AB = A(M);
end
UNTESTED CODE Both versions have been adjusted for the publication in the forum, which might have inserted bugs.
Matlab's function for handling sets are very powerful and optimized for huge data sets and handling different input types. But a specific code can be much faster for a small inputs.
6 Comments
Jan
on 10 Mar 2018
Edited: Jan
on 10 Mar 2018
@Sebastian: You are welcome. The final code looks straight and tight. I think it is very efficient for small cell strings, where "small" is maybe 100 strings.
There is no need for excuses. I tried to help you efficiently and you have been the one, who needs the solution. I'm glad, if I could help you.
More Answers (0)
See Also
Categories
Find more on Characters and Strings in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!