Its hard to explain but an example of what I need is:
[3 7 2 2 3 100 7 56] => [1 2 3 3 1 4 2 5]
I have a large vector containing many numbers including duplicates.
There is no order (well ... actually ordering them might be possible
if that helps but its not the default). Not all the numbers between
the minimum and the maximum are necessarily present. I need to map the
numbers in the vector to 1, 2, 3 etc.
The vector is very large, say 2e4 to 2e6 elements. I cant figure out
how to vectorize it. The best I have been able to do is:
% input is vec
uniques = unique(vec); % the mapping
mapped_vec = zeros(length(uniques), 1); % pre-allocate
for i = 1:length(uniques);
mapped_vec(vec == uniques(i)) = i; % replace each
element with its new value from the map
end;
The loop body is accounting for 95% of my execution time. Need to get
some speed. HELP!
Pietro <pietromas@gmail.com> wrote in message <5b74607b-c527-4c9b-
833a-b08d4be5d48a@s13g2000prd.googlegroups.com>...
> Its hard to explain but an example of what I need is:
>
> [3 7 2 2 3 100 7 56] => [1 2 3 3 1 4 2 5]
>
> I have a large vector containing many numbers including duplicates.
> There is no order (well ... actually ordering them might be possible
> if that helps but its not the default). Not all the numbers between
> the minimum and the maximum are necessarily present. I need to map the
> numbers in the vector to 1, 2, 3 etc.
>
> The vector is very large, say 2e4 to 2e6 elements. I cant figure out
> how to vectorize it. The best I have been able to do is:
>
> % input is vec
> uniques = unique(vec); % the mapping
> mapped_vec = zeros(length(uniques), 1); % pre-allocate
> for i = 1:length(uniques);
> mapped_vec(vec == uniques(i)) = i; % replace each
> element with its new value from the map
> end;
>
> The loop body is accounting for 95% of my execution time. Need to get
> some speed. HELP!
help unique
I wonder if its possible that the other returned
arguments might be of interest. The third one
maybe?
"us " <us@neurol.unizh.ch> wrote in message
<fogepf$a4$1@fred.mathworks.com>...
> "John D'Errico":
> <SNIP has a bad hair day...
>
> > I wonder if its possible that the other returned
> > arguments might be of interest. The third one
> > maybe...
>
> hmmmmmm... <john>...
>
> % the user wants
> vu=[3 7 2 2 3 100 7 56];
> ru=[1 2 3 3 1 4 2 5];
>
> % now, we both know that none of <unique>'s outputs
> % will achieve this... in particular: #3 is useless
No. The author stated that the specific order
was not really necessary. Only the behavior.
To refresh your memory...
>> There is no order (well ... actually ordering them might be possible
>> if that helps but its not the default). Not all the numbers between
>> the minimum and the maximum are necessarily present. I need to
>> map the numbers in the vector to 1, 2, 3 etc.
a=[3 7 2 2 3 100 7 56];
[b,I,J] = unique(a);
J =
2 3 1 1 2 5 3 4
Note that the ordering is different, but the
result is exactly as requested. The elements
are mapped to a list of indexes starting with 1
as requested. And, it takes only one function
call.
John D'Errico <woodchips@rochester.rr.com> wrote:
> "us " <us@neurol.unizh.ch> wrote in message
> <fogepf$a4$1@fred.mathworks.com>...
>> "John D'Errico":
>> <SNIP has a bad hair day...
[snip]
Some interesting discussion. Assuming that John is on target here, then
the following is an alternative which works a bit faster for me:
>> vec = ceil(rand(1e6,1)*1e6); % Some random numbers
>> x = zeros(max(vec)-min(vec)+1,1) + nan; % A mapping vector
>> x(vec) = 1; % Find the unique values
>> nu = sum(~isnan(x)); % Count them
>> x(~isnan(x)) = 1:nu; % Index them
>> y = x(vec); % Use the index in reverse
This does rely on the values all being integers, and does take more than
one line of code, but depending on the distribution of the values in vec,
it may give you some improvement in speed.
On 8 Feb, 10:21, tristram.sc...@ntlworld.com (Tristram Scott) wrote:
> John D'Errico <woodch...@rochester.rr.com> wrote:
> > "us " <u...@neurol.unizh.ch> wrote in message
> > <fogepf$a...@fred.mathworks.com>...
> >> "John D'Errico":
> >> <SNIP has a bad hair day...
>
> [snip]
>
> Some interesting discussion. Assuming that John is on target here, then
> the following is an alternative which works a bit faster for me:
>
> >> vec = ceil(rand(1e6,1)*1e6); % Some random numbers
> >> x = zeros(max(vec)-min(vec)+1,1) + nan; % A mapping vector
> >> x(vec) = 1; % Find the unique values
> >> nu = sum(~isnan(x));% Count them
> >> x(~isnan(x)) = 1:nu; % Index them
> >> y = x(vec); % Use the index in reverse
>
> This does rely on the values all being integers, and does take more than
> one line of code, but depending on the distribution of the values in vec,
> it may give you some improvement in speed.
>
> --
> Dr Tristram J. Scott
> Energy Consultant
Outstanding. Thank you both for two fine solutions. Johns is obviously
the nicer but I agree with Dr. Scott that his is significantly faster,
if pig ugly.
With 500 unique values in vec drawn from the range [1, 1e6], the
unique solution runs in about 0.5 and 9 seconds on a vec of size 1e6
and 1e7 respectively. The ugly solution does it in 0.1 and 0.7
respectively. Nice.
tristram.scott@ntlworld.com (Tristram Scott) wrote in message <33Wqj.5343
$j95.4025@newsfe3-win.ntli.net>...
> Some interesting discussion. Assuming that John is on target here, then
> the following is an alternative which works a bit faster for me:
>
> >> vec = ceil(rand(1e6,1)*1e6); % Some random numbers
> >> x = zeros(max(vec)-min(vec)+1,1) + nan; % A mapping vector
> >> x(vec) = 1; % Find the unique values
> >> nu = sum(~isnan(x)); % Count them
> >> x(~isnan(x)) = 1:nu; % Index them
> >> y = x(vec); % Use the index in reverse
>
> This does rely on the values all being integers, and does take more than
> one line of code, but depending on the distribution of the values in vec,
> it may give you some improvement in speed.
>
> Dr Tristram J. Scott
---------
Tristram, I think it should be emphasized that, as you indicated, the
efficiency of the code you suggested depends heavily on how densely the
numbers occur within their range of possible integer values. For example, if
you were to start out with a much more sparsely distributed
vec = ceil(rand(1e6,1)*1e10);
it would require considerably more execution time because of the excessive
length of your vector x, (not to mention presenting some formidable memory
problems,) even though the length of vec is no particular problem.
I still like the use of 'unique' for this kind of problem. It is more robust in
that it doesn't depend on the how concentrated the values are, nor whether
they are integer-valued.
well, well...
since the order does not matter, one might look a bit
further and come up with an even faster (2-line) solution...
the code snippet below shows its performance AND uncovers
that <tc> solutions may get very slow and sometimes be
incorrect(!?)
the code (copy/pasted) from previous posts
function t=foo
fmt={
'%5d %5d = %3.0f'
'%11.3f %11.3f '
'isequal: %-1d %-1d'
};
nt=10;
t=zeros(nt,7);
for k=1:nt
a=ceil(rand(10^4,1)*10^k).'; % integers
% ***** jd *****
tic;
[b,J,J]=unique(a);
t(k,3)=toc;
try % look out for memory problems
% ***** tc *****
vec=a;
tic;
x = zeros(max(vec)-min(vec)+1,1) + nan;
x(vec) = 1;
nu = sum(~isnan(x));
x(~isnan(x)) = 1:nu;
y = x(vec);
t(k,4)=toc;
catch
t(k,4)=nan;
y=J.';
end
% ***** us *****
tic;
[as,ax]=sort(a);
as(ax)=cumsum([1,diff(as)]~=0);
t(k,5)=toc;
% run-time result
t(k,1:2)=[numel(a),numel(b)];
t(k,6:7)=[isequal(J,y.'),isequal(J,as)];
t(k,3:5)=100*t(k,3:5)./t(k,3);
disp(sprintf([fmt{:}],t(k,:)));
end
end
output on a wintel system c2.2*2.4/2gb/vista/r2007b (look
at it in the original version for proper formatting...)
us wrote:
> Pietro:
> <SNIP fast re-indexing...
>
> well, well...
> since the order does not matter, one might look a bit
> further and come up with an even faster (2-line) solution...
> the code snippet below shows its performance AND uncovers
> that <tc> solutions may get very slow and sometimes be
> incorrect(!?)
>
> the code (copy/pasted) from previous posts
>
> function t=foo
> fmt={
> '%5d %5d = %3.0f'
> '%11.3f %11.3f '
> 'isequal: %-1d %-1d'
> };
> nt=10;
> t=zeros(nt,7);
> for k=1:nt
> a=ceil(rand(10^4,1)*10^k).'; % integers
> % ***** jd *****
> tic;
> [b,J,J]=unique(a);
> t(k,3)=toc;
> try % look out for memory problems
> % ***** tc *****
> vec=a;
> tic;
> x = zeros(max(vec)-min(vec)+1,1) + nan;
> x(vec) = 1;
> nu = sum(~isnan(x));
> x(~isnan(x)) = 1:nu;
> y = x(vec);
> t(k,4)=toc;
> catch
> t(k,4)=nan;
> y=J.';
> end
> % ***** us *****
> tic;
> [as,ax]=sort(a);
> as(ax)=cumsum([1,diff(as)]~=0);
> t(k,5)=toc;
> % run-time result
> t(k,1:2)=[numel(a),numel(b)];
> t(k,6:7)=[isequal(J,y.'),isequal(J,as)];
> t(k,3:5)=100*t(k,3:5)./t(k,3);
> disp(sprintf([fmt{:}],t(k,:)));
> end
> end
>
> output on a wintel system c2.2*2.4/2gb/vista/r2007b (look
> at it in the original version for proper formatting...)
>
> 1) using the ~original input
> a=[3 7 2 2 3 100 7 7 56];
> veclen uniquelen jd=100% tc% us% equal(jd,tc) equal(jd,us)
> 9 5 = 100 46.076 13.029 isequal: 1 1
>
> 2) using the loop, however
>
> 10000 10 = 100 43.648 96.747 isequal: 1 1
> 10000 100 = 100 30.893 76.001 isequal: 1 1
> 10000 1000 = 100 38.042 77.902 isequal: 1 1
> 10000 6366 = 100 108.570 70.773 isequal: 1 1
> 10000 9521 = 100 1261.356 68.274 isequal: 0 1 !
> 10000 9950 = 100 12899.902 66.274 isequal: 0 1 !
> 10000 9997 = 100 125642.159 64.475 isequal: 0 1 !
> 10000 10000 = 100 NaN 75.425 isequal: 1 1
> 10000 10000 = 100 NaN 69.354 isequal: 1 1
> 10000 10000 = 100 NaN 73.868 isequal: 1 1
>
> just a few thoughts...
> us
On 8 Feb, 22:46, "us " <u...@neurol.unizh.ch> wrote:
> Pietro:
> <SNIP fast re-indexing...
>
> well, well...
> since the order does not matter, one might look a bit
> further and come up with an even faster (2-line) solution...
> the code snippet below shows its performance AND uncovers
> that <tc> solutions may get very slow and sometimes be
> incorrect(!?)
>
> the code (copy/pasted) from previous posts
>
> function t=foo
> fmt={
> '%5d %5d = %3.0f'
> '%11.3f %11.3f '
> 'isequal: %-1d %-1d'
> };
> nt=10;
> t=zeros(nt,7);
> for k=1:nt
> a=ceil(rand(10^4,1)*10^k).'; % integers
> % ***** jd *****
> tic;
> [b,J,J]=unique(a);
> t(k,3)=toc;
> try % look out for memory problems
> % ***** tc *****
> vec=a;
> tic;
> x = zeros(max(vec)-min(vec)+1,1) + nan;
> x(vec) = 1;
> nu = sum(~isnan(x));
> x(~isnan(x)) = 1:nu;
> y = x(vec);
> t(k,4)=toc;
> catch
> t(k,4)=nan;
> y=J.';
> end
> % ***** us *****
> tic;
> [as,ax]=sort(a);
> as(ax)=cumsum([1,diff(as)]~=0);
> t(k,5)=toc;
> % run-time result
> t(k,1:2)=[numel(a),numel(b)];
> t(k,6:7)=[isequal(J,y.'),isequal(J,as)];
> t(k,3:5)=100*t(k,3:5)./t(k,3);
> disp(sprintf([fmt{:}],t(k,:)));
> end
> end
>
> output on a wintel system c2.2*2.4/2gb/vista/r2007b (look
> at it in the original version for proper formatting...)
>
> 1) using the ~original input
> a=[3 7 2 2 3 100 7 7 56];
> veclen uniquelen jd=100% tc% us% equal(jd,tc) equal(jd,us)
> 9 5 = 100 46.076 13.029 isequal: 1 1
>
> 2) using the loop, however
>
> 10000 10 = 100 43.648 96.747 isequal: 1 1
> 10000 100 = 100 30.893 76.001 isequal: 1 1
> 10000 1000 = 100 38.042 77.902 isequal: 1 1
> 10000 6366 = 100 108.570 70.773 isequal: 1 1
> 10000 9521 = 100 1261.356 68.274 isequal: 0 1 !
> 10000 9950 = 100 12899.902 66.274 isequal: 0 1 !
> 10000 9997 = 100 125642.159 64.475 isequal: 0 1 !
> 10000 10000 = 100 NaN 75.425 isequal: 1 1
> 10000 10000 = 100 NaN 69.354 isequal: 1 1
> 10000 10000 = 100 NaN 73.868 isequal: 1 1
>
> just a few thoughts...
> us
Thanks. That was helpful.
In my particular case I think the dominant factor is the size of the
input vector. I expect the values to be distributed between 1 and 1e7
give or take so this is the range I am testing with. By varying the
length of the vector I got these results (real time shown in brackets)
a = round(rand(10 ^ k, 1) * (1e4 - 1)) + 1; % generate 10^k elements,
with frequency of 10^(k-4)
tmp = round(rand(1, 1e4) * 10 ^ 7) + 1; % distributed between 1 and
10^ 7
a = tmp(a);
clear tmp;
Its hard to tell what to do. My vectors could be a million or two
long, but they could also be much much smaller. I suppose I should
dynamically select a method based on the length of the input.
us wrote:
> Pietro:
> <SNIP fast re-indexing...
>
> well, well...
> since the order does not matter, one might look a bit
> further and come up with an even faster (2-line) solution...
> the code snippet below shows its performance AND uncovers
> that <tc> solutions may get very slow and sometimes be
> incorrect(!?)
>
> the code (copy/pasted) from previous posts
>
> function t=foo
> fmt={
> '%5d %5d = %3.0f'
> '%11.3f %11.3f '
> 'isequal: %-1d %-1d'
> };
> nt=10;
> t=zeros(nt,7);
> for k=1:nt
> a=ceil(rand(10^4,1)*10^k).'; % integers
> % ***** jd *****
> tic;
> [b,J,J]=unique(a);
> t(k,3)=toc;
> try % look out for memory problems
> % ***** tc *****
> vec=a;
> tic;
> x = zeros(max(vec)-min(vec)+1,1) + nan;
> x(vec) = 1;
> nu = sum(~isnan(x));
> x(~isnan(x)) = 1:nu;
> y = x(vec);
> t(k,4)=toc;
> catch
> t(k,4)=nan;
> y=J.';
> end
> % ***** us *****
> tic;
> [as,ax]=sort(a);
> as(ax)=cumsum([1,diff(as)]~=0);
> t(k,5)=toc;
> % run-time result
> t(k,1:2)=[numel(a),numel(b)];
> t(k,6:7)=[isequal(J,y.'),isequal(J,as)];
> t(k,3:5)=100*t(k,3:5)./t(k,3);
> disp(sprintf([fmt{:}],t(k,:)));
> end
> end
>
> output on a wintel system c2.2*2.4/2gb/vista/r2007b (look
> at it in the original version for proper formatting...)
>
> 1) using the ~original input
> a=[3 7 2 2 3 100 7 7 56];
> veclen uniquelen jd=100% tc% us% equal(jd,tc) equal(jd,us)
> 9 5 = 100 46.076 13.029 isequal: 1 1
>
> 2) using the loop, however
>
> 10000 10 = 100 43.648 96.747 isequal: 1 1
> 10000 100 = 100 30.893 76.001 isequal: 1 1
> 10000 1000 = 100 38.042 77.902 isequal: 1 1
> 10000 6366 = 100 108.570 70.773 isequal: 1 1
> 10000 9521 = 100 1261.356 68.274 isequal: 0 1 !
> 10000 9950 = 100 12899.902 66.274 isequal: 0 1 !
> 10000 9997 = 100 125642.159 64.475 isequal: 0 1 !
> 10000 10000 = 100 NaN 75.425 isequal: 1 1
> 10000 10000 = 100 NaN 69.354 isequal: 1 1
> 10000 10000 = 100 NaN 73.868 isequal: 1 1
>
> just a few thoughts...
> us
On 8 Feb, 22:46, "us " <u...@neurol.unizh.ch> wrote:
> Pietro:
> <SNIP fast re-indexing...
>
> well, well...
> since the order does not matter, one might look a bit
> further and come up with an even faster (2-line) solution...
> the code snippet below shows its performance AND uncovers
> that <tc> solutions may get very slow and sometimes be
> incorrect(!?)
>
> the code (copy/pasted) from previous posts
>
> function t=foo
> fmt={
> '%5d %5d = %3.0f'
> '%11.3f %11.3f '
> 'isequal: %-1d %-1d'
> };
> nt=10;
> t=zeros(nt,7);
> for k=1:nt
> a=ceil(rand(10^4,1)*10^k).'; % integers
> % ***** jd *****
> tic;
> [b,J,J]=unique(a);
> t(k,3)=toc;
> try % look out for memory problems
> % ***** tc *****
> vec=a;
> tic;
> x = zeros(max(vec)-min(vec)+1,1) + nan;
> x(vec) = 1;
> nu = sum(~isnan(x));
> x(~isnan(x)) = 1:nu;
> y = x(vec);
> t(k,4)=toc;
> catch
> t(k,4)=nan;
> y=J.';
> end
> % ***** us *****
> tic;
> [as,ax]=sort(a);
> as(ax)=cumsum([1,diff(as)]~=0);
> t(k,5)=toc;
> % run-time result
> t(k,1:2)=[numel(a),numel(b)];
> t(k,6:7)=[isequal(J,y.'),isequal(J,as)];
> t(k,3:5)=100*t(k,3:5)./t(k,3);
> disp(sprintf([fmt{:}],t(k,:)));
> end
> end
>
> output on a wintel system c2.2*2.4/2gb/vista/r2007b (look
> at it in the original version for proper formatting...)
>
> 1) using the ~original input
> a=[3 7 2 2 3 100 7 7 56];
> veclen uniquelen jd=100% tc% us% equal(jd,tc) equal(jd,us)
> 9 5 = 100 46.076 13.029 isequal: 1 1
>
> 2) using the loop, however
>
> 10000 10 = 100 43.648 96.747 isequal: 1 1
> 10000 100 = 100 30.893 76.001 isequal: 1 1
> 10000 1000 = 100 38.042 77.902 isequal: 1 1
> 10000 6366 = 100 108.570 70.773 isequal: 1 1
> 10000 9521 = 100 1261.356 68.274 isequal: 0 1 !
> 10000 9950 = 100 12899.902 66.274 isequal: 0 1 !
> 10000 9997 = 100 125642.159 64.475 isequal: 0 1 !
> 10000 10000 = 100 NaN 75.425 isequal: 1 1
> 10000 10000 = 100 NaN 69.354 isequal: 1 1
> 10000 10000 = 100 NaN 73.868 isequal: 1 1
>
> just a few thoughts...
> us
Thanks. That was helpful.
In my particular case I think the dominant factor is the size of the
input vector. I expect the values to be distributed between 1 and 1e7
give or take so this is the range I am testing with. By varying the
length of the vector I got these results (real time shown in brackets)
a = round(rand(10 ^ k, 1) * (1e4 - 1)) + 1; % generate 10^k elements,
with frequency of 10^(k-4)
tmp = round(rand(1, 1e4) * 10 ^ 7) + 1; % distributed between 1 and
10^ 7
a = tmp(a);
clear tmp;
Its hard to tell what to do. My vectors could be a million or two
long, but they could also be much much smaller. I suppose I should
dynamically select a method based on the length of the input.
Public Submission Policy
NOTICE: Any content you submit to MATLAB Central, including personal information, is not subject to the protections which may be afforded information collected under other sections of The MathWorks, Inc. Web site. You are entirely responsible for
all content that you upload, post, e-mail, transmit or otherwise make available via MATLAB Central. The MathWorks does not control the content posted by visitors to MATLAB Central and, does not guarantee the accuracy, integrity, or quality of such content.
Under no circumstances will The MathWorks be liable in any way for any content not authored by The MathWorks, or any loss or damage of any kind incurred as a result of the use of any content posted, e-mailed, transmitted or otherwise made available
via MATLAB Central. Read the complete Disclaimer prior to use.