Vectorizing multiple string comparison

2 views (last 30 days)
Paolo Binetti on 26 Jan 2017
Commented: Paolo Binetti on 28 Jan 2017
Is there a way to significantly speed up this loop, perhaps by vectorizing it? Inputs in attachment. I do not have a Matlab version with "string" functions.
d = a';
for i = 1:numel(a)
d{i} = c(strcmp(a{i}, b), :);
end
I tried working my way from the inner part with cellfun, but either I am not getting it right or it is not the good approach:
aux = cellfun(@strcmp, a, b); % does not work
Paolo Binetti on 27 Jan 2017
You are right. R2016 does not run on the PC I mostly use, and old beast which still works perfectly, but on XP. So until I buy a new computer, I am stuck with either a much older version of Matlab or Octave, which does run on XP. I could have generated the input with my older Matlab. And your answer below gives me one more motivation to buy a new computer soon!

Guillaume on 26 Jan 2017
One obvious minor speed-up is to get rid of the find that serves absolutely no purpose. You can directly use the logical vector returned by strcmp:
d{i} = c(strcmp(a{i}, b)), :);
For some reason, I cannot load your mat file. I'm going to assume that a is a cell array of string, and so is b (otherwise the loop would not be needed). Assuming that there are no repeated strings in b:
assert(numel(unique(b)) == numel(b), 'This code does not work when there are duplicate values in b');
d = cell(size(a))';
[isfound, loc] = ismember(a, b);
d(isfound) = c(loc(isfound), :);
If it's guaranteed that all elements of a are found in b, then you can simplify even further to:
assert(numel(unique(b)) == numel(b), 'This code does not work when there are duplicate values in b');
[isfound, loc] = ismember(a, b);
assert(all(isfound), 'The next line only works if all elements of a are in b');
d = num2cell(c(loc, :), 2);
Guillaume on 27 Jan 2017
Edited: Guillaume on 27 Jan 2017
According to Walter, your mat file is an octave file that matlab can't open.
If there are duplicate values in b, then you don't have a choice but to use a loop, either explicitly as you have done or with cellfun:
d = cellfun(@(aa) c(strcmp(aa, b), :), a, 'UniformOutput', false);
It's very possible that the cellfun may be slower than the explicit loop (due to the anonymous function call).
edit: in matlab R2016b there is a an extremely easy way to vectorise the string comparison, using the new string class:
string(a) == string(b)'
but you'd still need a loop or cellfun afterward to create the d cell array:
d = cellfun(@(r) c(r, :), num2cell(string(a) == string(b)', 1), 'UniformOutput', false)

Walter Roberson on 27 Jan 2017
ismember can be used between cell arrays of strings. The two-output version can be used to find the indices, which you can then use to index into c.
Paolo Binetti on 28 Jan 2017
I had a feeling I was missing an obvious point. Thank you for pointing it out! The modified code, below, runs much faster. I tried to vectorize the remainder of the loop, to no avail, but the costly string comparison at least if out of the loop.
a = { 'AAG' 'AGA' 'ATT' 'CTA' 'CTC' 'GAT' 'TAA' 'TCT' 'TTC' };
b = { 'AAG' 'AGA' 'GAT' 'ATT' 'TTC' 'TCT' 'CTC' 'TCT' 'CTA' 'TAA' 'AAG' };
c = [ 'AGA';'GAT';'ATT';'TTC';'TCT';'CTC';'TCT';'CTA';'TAA';'AAG';'AGA' ];
[temp, idx] = ismember(b, a);
d = a';
for i = 1:numel(a)
d{i} = c(i == idx, :);
end