Asked by Michal Kvasnicka
on 9 May 2018

I am trying to find vectorized matlab function ind = item2ind(item,t) to solve following problem: I have a list of row vectors

item = [2 3 1; 2 1 2; 3 1 1; 1 3 3]

and vector of all possible item elements at each item row vector

t = [1 1 2 2 2 3 3];

I need to find indexes of of separate item rows elements corresponding to the vector t in this way:

ind = [3 6 1; 3 1 4; 6 1 2; 1 6 7]

But

item = [1 1 1]

does not correspond to the vector t, because there are 3 "1" elements, and t contains only 2 "1" elements.

Note: Serial version is inefficient for large item (10000 x 100) and t (1 x 200).

function ind = item2ind(item,t)

[nlp,N] = size(item);

ind = zeros(nlp,N);

for i = 1:nlp

auxitem = item(i,:);

auxt = t;

for j = 1:N

I = find(auxitem(j) == auxt,1,'first');

if ~isempty(I)

auxt(I) = 0;

ind(i,j) = I;

else

error('Incompatible content of item and t.');

end

end

end

end

Additional remarks:

Most of the time is spent on the line:

I = find(auxitem(j) == auxt,1,'first');

Is there any clever trick how to speed up this line of code? I tried this, for example, but without any speedup:

I = ipos(auxitem(j) == auxt); I = I(1);

where ipos is preallocated as:

ipos = 1:length(t);

Thanks in advance for any help ...

Answer by Michal Kvasnicka
on 11 May 2018

Edited by Michal Kvasnicka
on 11 May 2018

Accepted Answer

So far best solution:

function ind = item2ind_new(item,t)

t = t(:);

[m,n] = size(item);

mct = max(accumarray(t,1));

G = accumarray(t,1:length(t),[],@(x) {sort(x)});

G = cellfun(@(x) padarray(x.',[0 mct-length(x)],0,'post'), G, 'UniformOutput', false);

G = vertcat(G{:});

C = cumsum(reshape(item,m,1,n)==item,3);

ia = C(sub2ind(size(C),repelem((1:m).',1,n),repelem(1:n,m,1),repelem(1:n,m,1)));

ind = G(sub2ind(size(G),item,ia));

Any idea how to improve it?

Sign in to comment.

Answer by Jan
on 9 May 2018

Edited by Jan
on 9 May 2018

function ind = item2ind(item, t);

maxRun = length(t) + 1;

[T , TI] = accumsort(t, maxRun);

ind = zeros(size(item));

for k = 1:size(item, 1)

[aItem, aItemI] = accumsort(item(k, :), maxRun);

% [m, index] = ismember(aItem, T);

% Faster with undocumented function:

[m, index] = builtin('_ismemberhelper', aItem, T);

if all(m)

ind(k, aItemI) = TI(index);

else

error('Incompatible item.');

end

end

end

function [T, TI] = accumsort(t, maxRun)

[sortedT, TI] = sort(t);

T = sortedT * maxRun;

c = -1;

for k = 1:numel(T)

if T(k) ~= c

d = 0;

c = T(k);

else

d = d + 1;

end

T(k) = T(k) + d;

end

end

For some test data of size [10'000 x 100] I get a runtime of 0.21 sec instead of 1.3 sec of the original version.

With calling ismember the runtime is 0.41 sec. Internally ismember calls the helper function builtin('_ismemberhelper') for sorted data of type double. If it is known already, that the input is sorted, calling the internal function avoids the overhead.

If you have a C compiler, converting accumsort to a C-mex would be useful.

maxRun must a any number greater than the highest number of repetitions in t. length(t)+1 is guaranteed to be larger.

Michal Kvasnicka
on 10 May 2018

Intel® Core™ i7-6800K Processor + X99 Asus motherboard + 64GB RAM + 240GB SSD Intel Pro disk (system) + 2TB HDD (Data) + GTX TITAN (CUDA) GPU

Ubuntu Linux 16.04 (64bit)

Wick
on 10 May 2018

Michal Kvasnicka
on 10 May 2018

@Jan any progress or new ideas on your site?

Sign in to comment.

Answer by Wick
on 9 May 2018

Edited by Wick
on 9 May 2018

Here you go. At the sizes you suggested, this shouldn't take too long. It has a single 'for' loop that cycles through the unique values of 't'.

I'm using logical indexing to identify all the elements in 'item' that match the given unique 't' and summing across the row. If the sum exceeds the number of times that value showed up in 't' you get your error. Otherwise I'm using 'cumsum' in a creative fashion (in my ever so humble opinion) to provide the indexes back to the location of the unique value in the original vector 't'.

Good Luck!

function ind = item2ind(item,t)

unique_t = unique(t);

ind = zeros(size(item));

try

% a single 'for' loop as long as the unique elements of t

for jj = 1:length(unique_t)

O = zeros(size(item));

O(item == unique_t(jj)) = 1;

positions_of_t = [0 find(t == unique_t(jj))];

% adding zero so sub_index call below will always reference a non-zero element

sub_index = cumsum(O,2) .* O + 1;

ind = ind + positions_of_t(sub_index);

% this is why we needed the 0 in positions_of_t above

end

catch

error('Incompatible content of item and t.');

end

Jan
on 9 May 2018

For your test data:

t = 1:50;

tbig = repmat(t,1,5);

[~,p] = sort(rand(100000,length(tbig)),2);

item = tbig(p);

Your code from the question stops with the error "Incompatible content of item and t." So how can you measure the timings?!

The same for:

t = 1:10;

tbig = repmat(t,1,25);

[~,p] = sort(rand(100000,length(tbig)),2);

item = tbig(p);

I tried it with:

t = repmat(1:50, 1, 5);

t = t(randperm(length(t)));

item = zeros(10000, 100);

for k = 1:10000; item(k, :) = t(randperm(length(t), 100)); end

This let your (and my) item2ind() run successfully.

Do I miss a point?!

Michal Kvasnicka
on 9 May 2018

This is strange, my code perfectly works with both data case examples you mentioned above … ??!!

Wick
on 9 May 2018

Jan,

My code is faster for small length 't' and much, much slower for large 't'. You vectorized in a completely different way than I did (and used an undocumented function but we won't use that against you). My question is, is there some rule of thumb my snippet of code didn't follow that I should change how I code things? I've always felt I was pretty good at vectorizing my MATLAB code but I've been coming here to learn how to be better. Obviously, you know some tricks I don't.

Thanks.

Sign in to comment.

Opportunities for recent engineering grads.

Apply Today
## 6 Comments

## Wick (view profile)

## Direct link to this comment

https://ch.mathworks.com/matlabcentral/answers/399901-find-indices-of-row-subsets#comment_566073

## Michal Kvasnicka (view profile)

## Direct link to this comment

https://ch.mathworks.com/matlabcentral/answers/399901-find-indices-of-row-subsets#comment_566077

## Wick (view profile)

## Direct link to this comment

https://ch.mathworks.com/matlabcentral/answers/399901-find-indices-of-row-subsets#comment_566081

## Michal Kvasnicka (view profile)

## Direct link to this comment

https://ch.mathworks.com/matlabcentral/answers/399901-find-indices-of-row-subsets#comment_566089

## Jan (view profile)

## Direct link to this comment

https://ch.mathworks.com/matlabcentral/answers/399901-find-indices-of-row-subsets#comment_566178

## Michal Kvasnicka (view profile)

## Direct link to this comment

https://ch.mathworks.com/matlabcentral/answers/399901-find-indices-of-row-subsets#comment_566196

Sign in to comment.