Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
Getting indexes of rows of matrix with more than n repetitions

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Sudheer Tumu

Date: 5 Jan, 2010 03:57:02

Message: 1 of 56

Hi All,

I have two dimensional matrix with more than 10 million rows.
I want to get the indexes of rows which are repeted more than n times. n should be passed as argument.

Eg: A=[1 2
           3 4
           5 6
           1 2
           1 2
           3 4
           1 2
           5 6
           7 8
           1 2
           3 4]
I want result like this by passing n=3 as parameter
Result=[1 2 4 5 6 7 10 11] Since [ 1 2] and [3 4] are the only rows which are repeated more than 3 times. Is there any way to do this without using loops.
I am using loops which is taking nearly 30 mins to get the solution.

This is what I am doing:
for i=min(A(:,1)):max(A(:,1))
    a1=find(A(:,1)==i);
    for j=min(A(:,2)):max(A(:,2))
        a2=find(A(a1(:),2)== j);
        
        if size(a2,1) >= n
        end
   end
end


Thank you,
Sudheer Tumu.

Subject: Getting indexes of rows of matrix with more than n repetitions

From: us

Date: 5 Jan, 2010 08:34:04

Message: 2 of 56

"Sudheer Tumu" <tumusudheer@gmail.com> wrote in message <hhudae$4p$1@fred.mathworks.com>...
> Hi All,
>
> I have two dimensional matrix with more than 10 million rows.
> I want to get the indexes of rows which are repeted more than n times. n should be passed as argument.
>
> Eg: A=[1 2
> 3 4
> 5 6
> 1 2
> 1 2
> 3 4
> 1 2
> 5 6
> 7 8
> 1 2
> 3 4]
> I want result like this by passing n=3 as parameter
> Result=[1 2 4 5 6 7 10 11] Since [ 1 2] and [3 4] are the only rows which are repeated more than 3 times. Is there any way to do this without using loops.
> I am using loops which is taking nearly 30 mins to get the solution.
>
> This is what I am doing:
> for i=min(A(:,1)):max(A(:,1))
> a1=find(A(:,1)==i);
> for j=min(A(:,2)):max(A(:,2))
> a2=find(A(a1(:),2)== j);
>
> if size(a2,1) >= n
> end
> end
> end
>
>
> Thank you,
> Sudheer Tumu.

one of the solutions

     n=3;
     v=[
          1, 3, 5, 1, 1, 3, 1, 5, 7, 1, 3
          2, 4, 6, 2, 2, 4, 2, 6, 8, 2, 4
     ].';
     [vu,vx,vx]=unique(v,'rows');
     vn=histc(vx,1:max(vx));
     ix=find(ismember(vx,find(vn>=n))).'
% ix = 1 2 4 5 6 7 10 11

us

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Sudheer Tumu

Date: 5 Jan, 2010 18:58:05

Message: 3 of 56

"us " <us@neurol.unizh.ch> wrote in message <hhuths$led$1@fred.mathworks.com>...
> "Sudheer Tumu" <tumusudheer@gmail.com> wrote in message <hhudae$4p$1@fred.mathworks.com>...
> > Hi All,
> >
> > I have two dimensional matrix with more than 10 million rows.
> > I want to get the indexes of rows which are repeted more than n times. n should be passed as argument.
> >
> > Eg: A=[1 2
> > 3 4
> > 5 6
> > 1 2
> > 1 2
> > 3 4
> > 1 2
> > 5 6
> > 7 8
> > 1 2
> > 3 4]
> > I want result like this by passing n=3 as parameter
> > Result=[1 2 4 5 6 7 10 11] Since [ 1 2] and [3 4] are the only rows which are repeated more than 3 times. Is there any way to do this without using loops.
> > I am using loops which is taking nearly 30 mins to get the solution.
> >
> > This is what I am doing:
> > for i=min(A(:,1)):max(A(:,1))
> > a1=find(A(:,1)==i);
> > for j=min(A(:,2)):max(A(:,2))
> > a2=find(A(a1(:),2)== j);
> >
> > if size(a2,1) >= n
> > end
> > end
> > end
> >
> >
> > Thank you,
> > Sudheer Tumu.
>
> one of the solutions
>
> n=3;
> v=[
> 1, 3, 5, 1, 1, 3, 1, 5, 7, 1, 3
> 2, 4, 6, 2, 2, 4, 2, 6, 8, 2, 4
> ].';
> [vu,vx,vx]=unique(v,'rows');
> vn=histc(vx,1:max(vx));
> ix=find(ismember(vx,find(vn>=n))).'
> % ix = 1 2 4 5 6 7 10 11
>
> us

Hi us,

Thank you very very much, Your code is working great.

Now It is taking 30 sec . It is great when It is compared to 1300 sec. Is there any fast way to do than taking 30 sec.
One more doubt, how did you know all these fast way doing things on Mtalab. Did you know all these by practice or is there any other way to learn things on own, like any material like etc ..

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Jan Simon

Date: 6 Jan, 2010 00:12:05

Message: 4 of 56

Dear Sudheer!

> > n=3;
> > v=[
> > 1, 3, 5, 1, 1, 3, 1, 5, 7, 1, 3
> > 2, 4, 6, 2, 2, 4, 2, 6, 8, 2, 4
> > ].';
> > [vu,vx,vx]=unique(v,'rows');
> > vn=histc(vx,1:max(vx));
> > ix=find(ismember(vx,find(vn>=n))).'
> > % ix = 1 2 4 5 6 7 10 11
> >
> > us

> Now It is taking 30 sec . It is great when It is compared to 1300 sec. Is there any fast way to do than taking 30 sec.

If the values are integers e.g. between 0 and 16, it might be an advantage to combine the 2 columns before running the method posted by us:
  v2 = v(:,1) + 17 * v(:, 2);
Then UNIQUE does not need the 'rows' flag and sorting might become faster.
There is some potential in us' method: UNIQUE and ISMEMBER sort their input (and perhaps HISTC does also?). So it would be faster to sort the large array once and operate with the sorted index.

Kind regards, Jan

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Bruno Luong

Date: 6 Jan, 2010 07:19:06

Message: 5 of 56

"Jan Simon" <matlab.THIS_YEAR@nMINUSsimon.de> wrote in message <hi0kgl$c56$1@fred.mathworks.com>...
unning the method posted by us:
> v2 = v(:,1) + 17 * v(:, 2);
> Then UNIQUE does not need the 'rows' flag and sorting might become faster.
> There is some potential in us' method: UNIQUE and ISMEMBER sort their input (and perhaps HISTC does also?). So it would be faster to sort the large array once and operate with the sorted index.

Jan,

HISCT is a dichotomy insertion method, which would be O(N*log(M)) in complexity where N and M are respectively the number of data and edges. So HISTC is faster than SORT when M<N in theory.
 
A better way for large array is ACCUMARRAY, see a thread here: http://www.mathworks.co.uk/matlabcentral/newsreader/view_thread/252639

Bruno

Subject: Getting indexes of rows of matrix with more than n repetitions

From: us

Date: 6 Jan, 2010 07:48:02

Message: 6 of 56

"Sudheer Tumu"
> Now It is taking 30 sec . It is great when It is compared to 1300 sec. Is there any fast way to do than taking 30 sec.

one of the solutions
- replace ISMEMBER by ISMEMBC...

     n=3;
     v=[
          1, 3, 5, 1, 1, 3, 1, 5, 7, 1, 3
          2, 4, 6, 2, 2, 4, 2, 6, 8, 2, 4
     ].';
     [vu,vx,vx]=unique(v,'rows');
     vn=histc(vx,1:max(vx));
% ix=find(ismember(vx,find(vn>=n))).' % <- ISMEMBER
     ix=find(ismembc(vx,find(vn>=n))).' % <- ISMEMBC
% ix = 1 2 4 5 6 7 10 11

% timing on a wintel sys ic2/2*2.6gzh/2gb/winxp.sp3.32/r2009b
% using the previous IX
tic; for i=1:10000; ix=find(ismember(vx,find(vn>=n))); end; t1=toc
tic; for i=1:10000; ix=find(ismembc(vx,find(vn>=n))); end; t2=toc
% t1 = 0.3857
% t2 = 0.082618

the big bottleneck, though, is the FIND, which you need given your task...

us

Subject: Getting indexes of rows of matrix with more than n repetitions

From: us

Date: 6 Jan, 2010 07:49:06

Message: 7 of 56

"Jan Simon"
> > > n=3;
> > > v=[
> > > 1, 3, 5, 1, 1, 3, 1, 5, 7, 1, 3
> > > 2, 4, 6, 2, 2, 4, 2, 6, 8, 2, 4
> > > ].';
> > > [vu,vx,vx]=unique(v,'rows');
> > > vn=histc(vx,1:max(vx));
> > > ix=find(ismember(vx,find(vn>=n))).'
> > > % ix = 1 2 4 5 6 7 10 11

> Then UNIQUE does not need the 'rows' flag and sorting might become faster.
> There is some potential in us' method: UNIQUE and ISMEMBER sort their input (and perhaps HISTC does also?). So it would be faster to sort the large array once and operate with the sorted index.

??
us

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Jos (10584)

Date: 6 Jan, 2010 08:04:07

Message: 8 of 56

"us " <us@neurol.unizh.ch> wrote in message <hi1f7i$plv$1@fred.mathworks.com>...
> "Sudheer Tumu"
> > Now It is taking 30 sec . It is great when It is compared to 1300 sec. Is there any fast way to do than taking 30 sec.
>
> one of the solutions
> - replace ISMEMBER by ISMEMBC...
>
> n=3;
> v=[
> 1, 3, 5, 1, 1, 3, 1, 5, 7, 1, 3
> 2, 4, 6, 2, 2, 4, 2, 6, 8, 2, 4
> ].';
> [vu,vx,vx]=unique(v,'rows');
> vn=histc(vx,1:max(vx));
> % ix=find(ismember(vx,find(vn>=n))).' % <- ISMEMBER
> ix=find(ismembc(vx,find(vn>=n))).' % <- ISMEMBC
> % ix = 1 2 4 5 6 7 10 11
>
> % timing on a wintel sys ic2/2*2.6gzh/2gb/winxp.sp3.32/r2009b
> % using the previous IX
> tic; for i=1:10000; ix=find(ismember(vx,find(vn>=n))); end; t1=toc
> tic; for i=1:10000; ix=find(ismembc(vx,find(vn>=n))); end; t2=toc
> % t1 = 0.3857
> % t2 = 0.082618
>
> the big bottleneck, though, is the FIND, which you need given your task...
>
> us

.. avoid FIND ...

n=3;
v=[
    1, 3, 5, 1, 1, 3, 1, 5, 7, 1, 3
    2, 4, 6, 2, 2, 4, 2, 6, 8, 2, 4
].';
[vu,vx,vx]=unique(v,'rows');

idx1 = 1:size(vu,1) ;
vn=histc(vx,idx1)
idx2 = 1:size(v,1)
t3 = idx2(ismembc(vx, idx1(vn>=n)))

I leave the accurate testing for speed to others ...

Jos

Subject: Getting indexes of rows of matrix with more than n repetitions

From: us

Date: 6 Jan, 2010 09:02:06

Message: 9 of 56

"Jos (10584) "
> .. avoid FIND ...
> idx1 = 1:size(vu,1) ;
> vn=histc(vx,idx1)
> idx2 = 1:size(v,1)
> t3 = idx2(ismembc(vx, idx1(vn>=n)))
> I leave the accurate testing for speed to others ...
> Jos

well, it does not look too different...
- and even a bit better for FIND...
- on a wintel sys ic2/2*2.6ghz/2gb/winxp.sp3.32/r2009b
  the test function below yields these result

     7.1716 % <- JOS: logical indexing
     6.9788 % <- US : find
     102.76 % <- %gain

function t=foo
% the data
     nt=100000; % <- #runs
     nr=100; % <- #repetitions of original V
     n=3;
     v=[
          1, 3, 5, 1, 1, 3, 1, 5, 7, 1, 3
          2, 4, 6, 2, 2, 4, 2, 6, 8, 2, 4
     ].';
     t=nan(3,1);
     v=repmat(v,nr,1);
     [vu,vx,vx]=unique(v,'rows');
% JOS
     tic;
for i=1:nt
     r1=f1(n,v,vu,vx);
end
     t(1)=toc;
% US
     tic;
for i=1:nt
     r2=f2(n,v,vu,vx);
end
     t(2)=toc;
     t(3)=100*t(1)/t(2);
     isequal(r1(:),r2(:)); % <- TRUE
     disp(t);
end
% JOS
function ix=f1(n,v,vu,vx)
     idx1=1:size(vu,1) ;
     idx2=1:size(v,1);
     vn=histc(vx,idx1);
     ix=idx2(ismembc(vx, idx1(vn>=n)));
end
% US
function ix=f2(n,v,vu,vx)
     vn=histc(vx,1:max(vx));
     ix=find(ismembc(vx,find(vn>=n)));
end

just a thought...
us

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Jos (10584)

Date: 6 Jan, 2010 09:36:23

Message: 10 of 56

"us " <us@neurol.unizh.ch> wrote in message <hi1jie$4bp$1@fred.mathworks.com>...
> "Jos (10584) "
> > .. avoid FIND ...
> > idx1 = 1:size(vu,1) ;
> > vn=histc(vx,idx1)
> > idx2 = 1:size(v,1)
> > t3 = idx2(ismembc(vx, idx1(vn>=n)))
> > I leave the accurate testing for speed to others ...
> > Jos
>
> well, it does not look too different...
> - and even a bit better for FIND...

I already thought so ...

Jos

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Jan Simon

Date: 6 Jan, 2010 12:05:21

Message: 11 of 56

Dear Sudheer!

> > > Now It is taking 30 sec . It is great when It is compared to 1300 sec. Is there any fast way to do than taking 30 sec.

> > [vu,vx,vx]=unique(v,'rows');
> > vn=histc(vx,1:max(vx));
> > ix=find(ismembc(vx,find(vn>=n))).' % <- ISMEMBC

For some test data this is faster than UNIQUE/HISTC/ISMEMBC:

% ---------------------------------------- 8< ---------------------
function Index = test2(x, Len)
[m, n] = size(x);

% Inlined SORTROWS:
[v, ind1] = sort(x(:, 2));
[v, ind2] = sort(x(ind1, 1));
ind = ind1(ind2);
y = x(ind, :);

% Logical vector, TRUE for new values, FALSE for repeated values:
d(2:m) = any(diff(y, 1, 1), 2);
d(1) = true;

% Determine distance of TRUE values:
Dist = diff([find(d), m + 1]);
List(ind) = Dist(cumsum(d));
Index = find(List >= Len);
% ------------- >8 ------------------------------------------------

Perhaps this is faster for the real data.
But again: 1. If the values of the 2 columns could be combined, the sorting is much faster. 2. If the function is called multiple times for the same data, sorting it once promisses more speed than any advanced tricks inside the routine.

It would be nice to have a function SORTIND with replying just the sorting index. This would save time in many of my functions. Is this implemented already?

Kind regards, Jan

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Jos (10584)

Date: 6 Jan, 2010 13:25:20

Message: 12 of 56

"Jan Simon" <matlab.THIS_YEAR@nMINUSsimon.de> wrote in message
* snip *
> It would be nice to have a function SORTIND with replying just the sorting index. This would save time in many of my functions. Is this implemented already?


Hi Jan,

Do you mean the equivalent of the following function

function SortIndex = sortind(varargin)
[Dummy, SortIndex] = sort(varargin{:})

Jos

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Bruno Luong

Date: 6 Jan, 2010 13:48:05

Message: 13 of 56

"Jos (10584) " <#10584@fileexchange.com> wrote in message <hi2300$6dk$1@fred.mathworks.com>...
> "Jan Simon" <matlab.THIS_YEAR@nMINUSsimon.de> wrote in message
> * snip *
> > It would be nice to have a function SORTIND with replying just the sorting index. This would save time in many of my functions. Is this implemented already?
>
>
> Hi Jan,
>
> Do you mean the equivalent of the following function
>
> function SortIndex = sortind(varargin)
> [Dummy, SortIndex] = sort(varargin{:})
>
> Jos

Of course, newer Matlab has the "titled" syntax for optional input/output arguments, and this can be applied for all functions.

I tend not to use it because of the backward incompatible.

Bruno

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Jan Simon

Date: 6 Jan, 2010 14:12:04

Message: 14 of 56

Dear Jos!

> Do you mean the equivalent of the following function
>
> function SortIndex = sortind(varargin)
> [Dummy, SortIndex] = sort(varargin{:})
>
> Jos

Exactly. Creating [Dummy] wastes time in many of my functions.
SORT was a MEX in Matlab 6.5 (with C-source code), so it should be easy to omit the output. Nevertheless I'm unsure ever, if this collides with the license conditions...

Kind regards, Jan

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Jos (10584)

Date: 6 Jan, 2010 15:08:02

Message: 15 of 56

"Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <hi24al$3ig$1@fred.mathworks.com>...
> "Jos (10584) " <#10584@fileexchange.com> wrote in message <hi2300$6dk$1@fred.mathworks.com>...
> > "Jan Simon" <matlab.THIS_YEAR@nMINUSsimon.de> wrote in message
> > * snip *
> > > It would be nice to have a function SORTIND with replying just the sorting index. This would save time in many of my functions. Is this implemented already?
> >
> >
> > Hi Jan,
> >
> > Do you mean the equivalent of the following function
> >
> > function SortIndex = sortind(varargin)
> > [Dummy, SortIndex] = sort(varargin{:})
> >
> > Jos
>
> Of course, newer Matlab has the "titled" syntax for optional input/output arguments, and this can be applied for all functions.
>
> I tend not to use it because of the backward incompatible.
>
> Bruno

I agree with you on this, Bruno. I'll wait till I have no more access to old versions before using the tilde.

However, I doubt using the tilde is faster than using the dummy variable, since the memory for this variable is created within sort itself. I do not think that memory is actually being copied to the dummy variable.

Jos

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Matt Fig

Date: 6 Jan, 2010 17:00:21

Message: 16 of 56

"Jos (10584) " <#10584@fileexchange.com> wrote in message <hi290i$dsp$1@fred.mathworks.com>...
> "Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <hi24al$3ig$1@fred.mathworks.com>...
> > "Jos (10584) " <#10584@fileexchange.com> wrote in message <hi2300$6dk$1@fred.mathworks.com>...
> > > "Jan Simon" <matlab.THIS_YEAR@nMINUSsimon.de> wrote in message
> > > * snip *
> > > > It would be nice to have a function SORTIND with replying just the sorting index. This would save time in many of my functions. Is this implemented already?
> > >
> > >
> > > Hi Jan,
> > >
> > > Do you mean the equivalent of the following function
> > >
> > > function SortIndex = sortind(varargin)
> > > [Dummy, SortIndex] = sort(varargin{:})
> > >
> > > Jos
> >
> > Of course, newer Matlab has the "titled" syntax for optional input/output arguments, and this can be applied for all functions.
> >
> > I tend not to use it because of the backward incompatible.
> >
> > Bruno
>
> I agree with you on this, Bruno. I'll wait till I have no more access to old versions before using the tilde.
>
> However, I doubt using the tilde is faster than using the dummy variable, since the memory for this variable is created within sort itself. I do not think that memory is actually being copied to the dummy variable.
>
> Jos



Though it shouldn't make any speed difference internally, is there a reason you don't avoid creating the dummy variable in the first place? I.E.,

[idx,idx] = sort(A);

instead of

[dummy,idx] = sort(A);

To me they are equally readable, but that could be only because I am used to doing this. Obviously this technique doesn't work when it is the SECOND return argument that is the dummy.

Just curious.

Subject: Getting indexes of rows of matrix with more than n repetitions

From: us

Date: 6 Jan, 2010 17:14:20

Message: 17 of 56

"Matt Fig"
> Though it shouldn't make any speed difference internally, is there a reason you don't avoid creating the dummy variable in the first place? I.E.,
> [idx,idx] = sort(A);
> instead of
> [dummy,idx] = sort(A);
>
> To me they are equally readable, but that could be only because I am used to doing this. Obviously this technique doesn't work when it is the SECOND return argument that is the dummy.
> Just curious.

this has been shown many a times in CSSM...
- i'll always use
     [foo,foo,foo,goo,goo]=f();
  approach
- the newly introduced ~ - in my view - is a waste of precious syntax resources...

     format debug;
     [a,b]=max([1,2]) % <- creates TWO vars with diff addresses
     [c,c]=max([1,2]) % <- both have the same address indep of their position
%{
a =
     Structure address = afcc988
     pr = 2031f3b0
b =
     Structure address = afcf2a8 % <- ~= A (or, typically named DUMMY)
     pr = 2031f750 % <- ~= A (...)
c =
     Structure address = afcaab0
     pr = 2031ee10
c =
     Structure address = afcaab0 % <- == 1.st C
     pr = 2031ee10 %< == 1.st C
%}

us

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Matt Fig

Date: 6 Jan, 2010 17:53:04

Message: 18 of 56

"us " <us@neurol.unizh.ch> wrote in message
> this has been shown many a times in CSSM...
> - i'll always use
> [foo,foo,foo,goo,goo]=f();
> approach


I know you use this approach, as I do. I was asking why other folks prefer to create a dummy variable in the workspace.


> - the newly introduced ~ - in my view - is a waste of precious syntax resources...
>


I agree, except in the case I mentioned previously ;-).

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Bruno Luong

Date: 6 Jan, 2010 18:24:04

Message: 19 of 56

"Matt Fig" <spamanon@yahoo.com> wrote in message <hi2im0$fbd$1@fred.mathworks.com>...
> "us " <us@neurol.unizh.ch> wrote in message
> > this has been shown many a times in CSSM...
> > - i'll always use
> > [foo,foo,foo,goo,goo]=f();
> > approach
>
>
> I know you use this approach, as I do. I was asking why other folks prefer to create a dummy variable in the workspace.

Matt and us, I for once prefer the "dummy" approach (followed by a "clear" statement). It's just more readable to me (and I force myself to use different names for different variables). Any eloquence argument to convince me otherwise?

Bruno

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Bruno Luong

Date: 6 Jan, 2010 18:55:22

Message: 20 of 56

"Jan Simon" <matlab.THIS_YEAR@nMINUSsimon.de> wrote in message <hi25nk$5lo$1@fred.mathworks.com>...

>
> Exactly. Creating [Dummy] wastes time in many of my functions.

Jan, how much do you estimate an inplace sorting (e.g., on large double array) would save time?

I'm quite happy with the performance of matlab SORT, except I wish to have a sorting routine where the comparison operator can be customized - but I guess such feature is very inefficient in Matlab due to overhead.

Bruno

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Jan Simon

Date: 6 Jan, 2010 19:50:19

Message: 21 of 56

Dear Matt!

> > [foo,foo,foo,goo,goo]=f();
> I know you use this approach, as I do. I was asking why other folks prefer to create a dummy variable in the workspace.

The dummy is created with this also, but immediately overwritten!

If I have a giantic vector, which nearly fills my memory, and I want to sort it with:
  [dummy, index] = sort(Array);
or
  [index, index] = sort(Array);
in both cases the array is created (any different opinions?)!
In the case of cell strings, the sorted array contains shared data copied - fortuantely. But if I never need the sorted array, it would be nice to have a SORTIND, which replies just the index vector. Is anybody willing to publish a MEX wrapper for a quicksort???

That [index] and [index] have the same address is not really surprising. It is the same variable. You do not need a FORMAT DEBUG for that.

I failed: Matlab 6.5 has not a sort.c, but sortcellchar.c. This exists in Matlab 7.8 also, but is not documented. Lukily this solves my question for cell strings.
The MEX function sortrowsc.c is not an alternative, because it is 6 times slower than SORT if applied to a single column (why?!).

If I show "[index, index] = sort(Array)" to a Matlab beginner, I have to explain, that I assume, that the output arguments of a function are assigned from the left to the right.
If I show "[dummy, index] = sort(Array)", I do not have to explain anything.

Kind regards, Jan

BTW. Has the OP tested my try to solve the problem faster than UNIQUE/HISTC/ISMEMBC ?

Subject: Getting indexes of rows of matrix with more than n repetitions

From: us

Date: 6 Jan, 2010 21:01:21

Message: 22 of 56

"Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <hi2kg4$b73$1@fred.mathworks.com>...
> "Matt Fig" <spamanon@yahoo.com> wrote in message <hi2im0$fbd$1@fred.mathworks.com>...
> > "us " <us@neurol.unizh.ch> wrote in message
> > > this has been shown many a times in CSSM...
> > > - i'll always use
> > > [foo,foo,foo,goo,goo]=f();
> > > approach
> >
> >
> > I know you use this approach, as I do. I was asking why other folks prefer to create a dummy variable in the workspace.
>
> Matt and us, I for once prefer the "dummy" approach (followed by a "clear" statement). It's just more readable to me (and I force myself to use different names for different variables). Any eloquence argument to convince me otherwise?
>
> Bruno

bruno - NO: here we go...
- i know: EVAL(!)... and nested TRY/CATCH...
- but YOU will easily understand...
- wintel sys ic2/2*2.6ghz/2gb/winxp.sp3.32/r2009b...

% create FOO.M
function foo(varargin)
% 1) subroutine GOO returns 2 vars in diff memory locations
% 2) fill FOO WS with 2 vars
% 3) if memory overflows...
% 4) clear all WS vars
% 5) fill FOO WS with 1 var at same memory address

     nt=10^8; % <- #var to create...
if ~nargin
     n=1000000; %#ok (r2009b)
else
     n=varargin{1}; %#ok (r2009b)
end
% fill FOO WS with individual var A_xxx/B_xxx
try
for i=1:nt
     com=sprintf('[a_%d,b_%d]=goo(n);',i,i);
     eval(com);
end
catch %#ok
     w=whos('b*'); % <- only count B_xxx
     disp(sprintf('ERROR at index %5d %5d',i,numel(w)));
     clear a* b*; % <- CLEAR VARS
% fill FOO WS with individual var B_xxx
try
for i=1:nt
     com=sprintf('[b_%d,b_%d]=goo(n);',i,i);
     eval(com);
end
catch %#ok
     w=whos('b*'); % <- only count B_xxx
     disp(sprintf('ERROR at index %5d %5d',i,numel(w)));
end
end
end
function [a,b]=goo(varargin) %#ok
     a=zeros(1,varargin{1},'double');
     b=a;
     b(1)=b(1)+1; % <- allocate new memory...
end

% at the command prompt
     clear all; % <- !!!!!
     foo(100000)
%{
ERROR at index 214 213
ERROR at index 427 426
%}
     foo(1000000)
%{
ERROR at index 17 16
ERROR at index 33 32
%}

hence, [DUMMY,VAR] is taxing the WS, whilst [VAR,VAR] is NOT (as much)...
just a thought...
urs

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Jan Simon

Date: 6 Jan, 2010 21:07:02

Message: 23 of 56

Dear Bruno!

> Jan, how much do you estimate an inplace sorting (e.g., on large double array) would save time?

It is clear, that the creation of the sort index is the demanding part of SORT, while the copy of the input in sorted order is secondary - usually. Unfortuantely one of the 2 DIMM ports of my computer is damaged and I have to live with 512MB RAM. But saving temporarily used memory is always useful, even on a 16 GB machine. Nevertheless, timing matters for 8MB array already:

For the estimation of the speed gain:
  x = rand(1e6, 1);
  tic; y = sort(x); toc ==> 0.26 sec
  tic; [y, s] = sort(x); toc ==> 0.40 sec
  tic; y = x(s); toc ==> 0.14 sec
  
So I assume I could save 35% computing time.
I assume, that SORT sorts the values inplace and the sorting index is created simultaneously, if 2 outputs are used.
Sorting a INT16 array is much faster:
  x = uint16(x * 32000);
  tic; y = sort(x); toc ==> 0.16 sec
  tic; [y, s] = sort(x); toc ==> 0.31 sec
Creating the sorting index needs additional 0.15 sec as in the DOUBLE case above.

If the replied index vector could be an UINT32 array, this would be even nicer:
  ints = uint32(s);
  tic; y = x(ints); toc ==> 0.09 sec (instead of 0.14 sec)

> I'm quite happy with the performance of matlab SORT, except I wish to have a sorting routine where the comparison operator can be customized - but I guess such feature is very inefficient in Matlab due to overhead.

SORT is really fine, you are right! But this is not a reason to avoid improving it.
If you would create a MEX, which replies the sorting index (perhaps in a chosable type), which uses the standard comparison or optionally a user-defined operator, which can compete in speed with single-output SORT -- *I* would download it! Promised.
I assume, calling a user-defined operator through mexCallMATLAB would be a great brake.

Kind regards, Jan

Subject: Getting indexes of rows of matrix with more than n repetitions

From: us

Date: 6 Jan, 2010 21:18:02

Message: 24 of 56

"Jan Simon" <matlab.THIS_YEAR@nMINUSsimon.de> wrote in message <hi2phr$5nj$1@fred.mathworks.com>...
> Dear Matt!
>
> > > [foo,foo,foo,goo,goo]=f();
> > I know you use this approach, as I do. I was asking why other folks prefer to create a dummy variable in the workspace.

> If I have a giantic vector, which nearly fills my memory, and I want to sort it with:
> [dummy, index] = sort(Array);
> or
> [index, index] = sort(Array);
> in both cases the array is created (any different opinions?)!

yes, see my reply (including FOO test program) to bl...
the point is
- if INDEX|1,2 is very large within the function, it will fail...
- if INDEX|1,2 is large and adds to the callers WS, it will fail a bit later...

> That [index] and [index] have the same address is not really surprising. It is the same variable. You do not need a FORMAT DEBUG for that.

no, of course, but it is nice to convince some ML agnostics by sheer command window output...

> If I show "[index, index] = sort(Array)" to a Matlab beginner, I have to explain, that I assume, that the output arguments of a function are assigned from the left to the right.
> If I show "[dummy, index] = sort(Array)", I do not have to explain anything.

well... NO mercy on this one: i know you've (probably) tried to explain to a C-novice an inscrutable pointer-construct...
they just have to learn...

SO - old CSSMers will stick with the
     [foo,foo,foo,goo,goo]=f();
syntax...

:-)
urs

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Bruno Luong

Date: 6 Jan, 2010 21:25:22

Message: 25 of 56

But us, isn't the test unfair when the variable a_i is not properly cleared?

I add the "clear" command and both syntax fails at the same places (see the new foo below).

%%%%%%%

function foo(varargin)
% 1) subroutine GOO returns 2 vars in diff memory locations
% 2) fill FOO WS with 2 vars
% 3) if memory overflows...
% 4) clear all WS vars
% 5) fill FOO WS with 1 var at same memory address

nt=10^8; % <- #var to create...
if ~nargin
    n=1000000; %#ok (r2009b)
else
    n=varargin{1}; %#ok (r2009b)
end
% fill FOO WS with individual var A_xxx/B_xxx
try
    for i=1:nt
        com=sprintf('[a_%d,b_%d]=goo(n);',i,i);
        eval(com);
        % Add by Bruno
        com=sprintf('clear a_%d',i);
        eval(com);
    end
catch %#ok
    w=whos('b*'); % <- only count B_xxx
    disp(sprintf('ERROR at index %5d %5d',i,numel(w)));
    clear a* b*; % <- CLEAR VARS
    % fill FOO WS with individual var B_xxx
    try
        for i=1:nt
            com=sprintf('[b_%d,b_%d]=goo(n);',i,i);
            eval(com);
        end
    catch %#ok
        w=whos('b*'); % <- only count B_xxx
        disp(sprintf('ERROR at index %5d %5d',i,numel(w)));
    end
end
end
function [a,b]=goo(varargin) %#ok
a=zeros(1,varargin{1},'double');
b=a;
b(1)=b(1)+1; % <- allocate new memory...
end

% Command line
>> foo(1000000)
ERROR at index 176 175
ERROR at index 176 175
>> foo(10000000)
ERROR at index 16 15
ERROR at index 16 15
>>

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Jan Simon

Date: 6 Jan, 2010 21:29:06

Message: 26 of 56

Dear us!

> for i=1:nt
> com=sprintf('[a_%d,b_%d]=goo(n);',i,i);
> eval(com);
> end

As far as I understand, Bruno talks of:
  for i=1:nt
       com=sprintf('[a_%d,b_%d]=goo(n); clear a_%d; ', i, i, i);
       eval(com);
  end

There should still be a tiny advantage for the [a_i, a_i] approach, because the firstly assigned variable is cleared before the second is created.
Beside this argument, I *like* [dummy, b_i] more, because it tells, what it does and reflects my intentions. So it's more a psychological decision :-)

Jan

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Bruno Luong

Date: 6 Jan, 2010 21:31:21

Message: 27 of 56

"us " <us@neurol.unizh.ch> wrote in message <hi2uma$77g$1@fred.mathworks.com>...

>
> SO - old CSSMers will stick with the
> [foo,foo,foo,goo,goo]=f();
> syntax...
>

Well, new member(s) still not entirely convinced... (see my reply to you us)

Btw, do we consider Jos as considered as junior or senior member? I consider myself and Jan as senior members - well relatively to some other. ;-)

Bruno

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Bruno Luong

Date: 6 Jan, 2010 21:48:04

Message: 28 of 56

btw, does the behavior when calling

[foo foo foo] = bar()

is documented? What I mean is: do we have the blessing of Mathworks?

Bruno

Subject: Getting indexes of rows of matrix with more than n repetitions

From: us

Date: 6 Jan, 2010 21:54:03

Message: 29 of 56

"Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <hi2v42$535$1@fred.mathworks.com>...
> But us, isn't the test unfair when the variable a_i is not properly cleared?
>
> I add the "clear" command and both syntax fails at the same places (see the new foo below).

bruno - well: NO(!, sorry)...
this were just the points - and:
- let's dissect the stuff once more

1) GOO creates A and B at different mem locs (A & B=A+1)...
2) if their size is too big, GOO itself will fail (trivial)...
3) but let's keep in mind: there's a CALLER's workspace (WS) as well: FOO
4) IT (foo) is filled with a lot of different vars [A_x/B_x] created by
    GOO (again, not failing by itself in each call)
5) however, FOO (now slowly filling up) is failing earlier if BOTH vars are put into its WS

altogether, the snippet intended to simulate a function's WS which is filled by calls to many (in real life: different) subroutines returning a lot of DUMMY(ie)s...
naturally, the (keen) programmer would have to take care of this by adding a CLEAR DUMMY statement after EACH call (because it's there, in the mem and contaminating any subsequent request in a subfunction!) ... quite tedious and ...unlovely...

here, we deal with massive vol data (coming from fmri studies) - and better believe me, we simply could not work with the [a,b] approach...

just a few more pedestrian thoughts...
us

Subject: Getting indexes of rows of matrix with more than n repetitions

From: us

Date: 6 Jan, 2010 22:03:04

Message: 30 of 56

"Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <hi2vf9$r8f$1@fred.mathworks.com>...
> "us " <us@neurol.unizh.ch> wrote in message <hi2uma$77g$1@fred.mathworks.com>...
>
> >
> > SO - old CSSMers will stick with the
> > [foo,foo,foo,goo,goo]=f();
> > syntax...
> >
>
> Well, new member(s) still not entirely convinced... (see my reply to you us)
>
> Btw, do we consider Jos as considered as junior or senior member? I consider myself and Jan as senior members - well relatively to some other. ;-)
>
> Bruno

bruno - wording was OLD, not senior/junior with respect to # of CSSM appearances/ML age... and simply referring to the fact that someone like me (53y) is, well... biologically... very old...

i guess i'm slowly getting to old to hang around in this (lovely) NG any longer...
us

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Bruno Luong

Date: 6 Jan, 2010 22:20:21

Message: 31 of 56

"us " <us@neurol.unizh.ch> wrote in message <hi30pr$nku$1@fred.mathworks.com>...
> "Bruno Luong" <b.luong@fogale.findmycountry> wrote in message
>
> bruno - well: NO(!, sorry)...
> this were just the points - and:
> - let's dissect the stuff once more
>
> 1) GOO creates A and B at different mem locs (A & B=A+1)...
> 2) if their size is too big, GOO itself will fail (trivial)...

Agree until now, and this common behavior stands regardless how GOO is called. Are you agree?

> 3) but let's keep in mind: there's a CALLER's workspace (WS) as well: FOO

Stop right there!!! Workspaces are important in one aspects:
- visibility
- and possibility the stack memory of the main thread

But data are still stored in a big common computer memory pool, regardless which workspace they belong.

> 4) IT (foo) is filled with a lot of different vars [A_x/B_x] created by
> GOO (again, not failing by itself in each call)

No. A_x will share the data in a quantum laps of time with the intern variable B (which anyway created).
 
> 5) however, FOO (now slowly filling up) is failing earlier if BOTH vars are put into its WS

No extra filling up if CLEAR is called after A_x is created and unmodified.
 
>
> altogether, the snippet intended to simulate a function's WS which is filled by calls to many (in real life: different) subroutines returning a lot of DUMMY(ie)s...
> naturally, the (keen) programmer would have to take care of this by adding a CLEAR DUMMY statement after EACH call (because it's there, in the mem and contaminating any subsequent request in a subfunction!) ... quite tedious and ...unlovely...

Beauty is in the eye of the beholder.

Bruno

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Jan Simon

Date: 6 Jan, 2010 22:27:02

Message: 32 of 56

Dear Bruno!

> > SO - old CSSMers will stick with the
> > [foo,foo,foo,goo,goo]=f();
> > syntax...
> >
>
> Well, new member(s) still not entirely convinced... (see my reply to you us)

> Btw, do we consider Jos as considered as junior or senior member? I consider myself and Jan as senior members - well relatively to some other. ;-)

I find Jos' first post in Feb 1997 ==> senior.
I'm posting since May 2004 ==> mid aged.
But google groups claims, that Bruno starts posting July 2008 ?!
Bruno: what have you done before July 2008? What have you been waiting for?

Of course, all CSSMers are welcome to use the foofoofoogoogoo.
And I really did not want to start a discussion about taste or the wise insertion of CLEAR commands.
My question was simply, if it wouldn't be nice to create a SORT function, which replies just the needed output. Or in the terms of Us:
  Old CSSMers: [foo,foo,foo,goo,goo]=f();
  Fresh CSSMers: [foo, goo]=f2();

Jan

Subject: Getting indexes of rows of matrix with more than n repetitions

From: us

Date: 6 Jan, 2010 22:42:03

Message: 33 of 56

"Bruno Luong"
> Beauty is in the eye of the beholder.
> Bruno

well then, no matter how we turn and twist this thingy
- the previous snippet FOO simply shows the ML community...

     [a,a]=f();
     [b,b]=g();
% is advantageous when compared to
     [D,a]=f();
     [D,b]=g(); % <- G has to deal with the useless waste of memory by D...
% and certainly more elegant than
     [D,a]=f();
     clear D; % <- a line of simple code
     [D,b]=g();
     clear D; % <- yet another line of simple code

the cliché 'beauty and the eye': whilst ok in real life - often get's one into a predicament in computer programming...

again, just a thought...
us

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Bruno Luong

Date: 6 Jan, 2010 23:45:23

Message: 34 of 56

"us " <us@neurol.unizh.ch> wrote in message <hi33jr$o4r$1@fred.mathworks.com>...
> "Bruno Luong"
> > Beauty is in the eye of the beholder.
> > Bruno
>
> well then, no matter how we turn and twist this thingy
> - the previous snippet FOO simply shows the ML community...
>
> [a,a]=f();

Some complementary:

%% The proper "beautiful" code is

% We don't need the first output.
% When the same variable is repeated on the lhs, the final value taken
% corresponds to the most right, i.e. second output of f() here, since Matlab
% affects from left to right out the output list (Attention: this convenstion
% is opposite direction of arabic reading) .
% Note: this is not documented, programmers who maintain this code
% need to check this behavior stands consistently when for any newer Matlab
% releases (twice a year) from now until year 3000.
% Also beware unexpected results when you do something like replacing
% all instant of "a" by "a(end+1)" using editor replacement feature.
[a,a]=f();

%%%%%%%%%%
%% The "ugly" code is

[D,a]=f();
clear D % not used

Bruno ;-)

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Matt Fig

Date: 7 Jan, 2010 00:22:06

Message: 35 of 56

Wow, I didn't realize what a hornets nest my simple question would kick up. ;-)

Bruno,
From the documentation:

"Functions return output values in the order in which the corresponding
output variables appear in the function definition line within the M-file."

So for M-Files, at least, there is some documentation. Also, even Loren admits to using this technique, at least prior to the ~ introduction! But for non-M-Files, you may have a point about the assumption that this behavior will persist.

I guess, to *some* degree, ([a,b], clear a) or ([b,b]) is personal preference.

Subject: Getting indexes of rows of matrix with more than n repetitions

From: us

Date: 7 Jan, 2010 00:57:03

Message: 36 of 56

"Bruno Luong"
> %% The proper "beautiful" code is
> % We don't need the first output.
> % When the same variable is repeated on the lhs, the final value taken
> % corresponds to the most right, i.e. second output of f() here, since Matlab
> % affects from left to right out the output list (Attention: this convenstion
> % is opposite direction of arabic reading) .
> % Note: this is not documented, programmers who maintain this code
> % need to check this behavior stands consistently when for any newer Matlab
> % releases (twice a year) from now until year 3000.

smile...
bruno, i guess we leave it for now in saying:
TMW
- will NEVER EVER (or better, even:) CANNOT change this order of assignment...
  why: because the whole syntax would break down
          a=foo()
          [a,b]=foo()
  now: imagine all the changes in the code that needed to be made if indeed
          [b,a]=foo()
  and, therefore,
          b=foo()
  thus forcing changes within the engines...
- hence, the intro of the tilde op was nothing but a big, useless blunder
  meant to deal with this very issue dealt with by CSSMers for decades already...
  [unfortunately, as said above, the ~ sign now occupies a precious syntax item:
   let's just hope that MLbbers are wise enough not to use this in order to
   prevent having to wake-up one day to read the horrible release note saying:
   sorry guys, the tilde now is used for something else - and really- useful]

as i said: with a smile
urs

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Matt Fig

Date: 7 Jan, 2010 04:16:03

Message: 37 of 56

"us " <us@neurol.unizh.ch> wrote in message
> - hence, the intro of the tilde op was nothing but a big, useless blunder
> meant to deal with this very issue dealt with by CSSMers for decades already...
> as i said: with a smile
> urs

Urs,

Except, as I said, when the dummy variable is the second returned output. In this case, there is no such syntax to keep the first returned arg and overwrite the second as in:

[a,a] = find(magic(5)==19) % We cannot get the row location only!

In this case, the tilde or clearing (as Bruno prefers) are the only choice. Here I would rather use the tilde!

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Doug Schwarz

Date: 7 Jan, 2010 04:49:32

Message: 38 of 56

In article <hi3bgv$bc1$1@fred.mathworks.com>,
 "us " <us@neurol.unizh.ch> wrote:

[snip]

> bruno, i guess we leave it for now in saying:
> TMW
> - will NEVER EVER (or better, even:) CANNOT change this order of
> assignment...
> why: because the whole syntax would break down
> a=foo()
> [a,b]=foo()
> now: imagine all the changes in the code that needed to be made if indeed
> [b,a]=foo()
> and, therefore,
> b=foo()
> thus forcing changes within the engines...
> - hence, the intro of the tilde op was nothing but a big, useless blunder
> meant to deal with this very issue dealt with by CSSMers for decades
> already...
> [unfortunately, as said above, the ~ sign now occupies a precious syntax
> item:
> let's just hope that MLbbers are wise enough not to use this in order to
> prevent having to wake-up one day to read the horrible release note
> saying:
> sorry guys, the tilde now is used for something else - and really- useful]
>
> as i said: with a smile
> urs


I usually use something like

  [unused,index] = sort(...)

because my intention is self-documented. Rarely do I clear unused
afterwards. If it was especially large I would.

Certainly we can always count on the output arguments matching the order
that they are listed in the function definition. The thing that is
undocumented, and therefore dangerous, is counting on the temporal order
being the same as the listed order. If TMW chooses to change MATLAB so
that in

  [x_sorted,index] = sort(...)

the second argument is assigned before the first then that will surely
break

  [index,index] = sort(...)

and I'm not willing to take the risk of having to change all my old code
-- especially the code that is in the possession of my clients! If it's
not documented behavior I don't do it. Period.

However, it's easy enough to wrap some of this up in a function:

------------------------------------------------------------
function varargout = argout(outargs,fcn,varargin)
% argout: Return selected output arguments.
%
% Syntax:
% [Y1,Y2,...] = argout(ARGS,FCN,X1,X2,...)
% will return the output arguments listed in the vector ARGS from the
% function FCN. ARGS must be a vector of integers and FCN must be a
% function handle.
%
% For example,
% indices = argout(2,@sort,x);
% can be used to return only the order index from the sort function
% which is normally the second output argument. You can even reverse
the
% order with
% [indices,x_sorted] = argout([2 1],@sort,x);
 
% written by Douglas M. Schwarz
 
m = max(outargs);
out = cell(1,m);
[out{:}] = fcn(varargin{:});
varargout = out(outargs);
----------------------------------------------------------


This will let you do

  index = argout(2,@sort,x);

or even

  [index,x_sorted] = argout([2 1],@sort,x);

if you choose. Because any unused arguments are assigned inside the
function they go away automatically. Still a one-liner, no clear
statement necessary.

--
Doug Schwarz
dmschwarz&ieee,org
Make obvious changes to get real email address.

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Jos (10584)

Date: 7 Jan, 2010 06:53:05

Message: 39 of 56

"Bruno Luong" <b.luong@fogale.findmycountry> wrote in message

> Btw, do we consider Jos as considered as junior or senior member? I consider myself and Jan as senior members - well relatively to some other. ;-)

For your information, I have been using ML for over 12 years now. This might make a senior for some ... and, although it is difficult to learn new tricks to an old dog, I am still learning a lot here.

Speaking of seniors, has anyone heard something from Roger Stafford lately?

Jos

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Bruno Luong

Date: 7 Jan, 2010 07:40:23

Message: 40 of 56

"Jos (10584) " <#10584@fileexchange.com> wrote in message <hi40ch$68p$1@fred.mathworks.com>...
>
> Speaking of seniors, has anyone heard something from Roger Stafford lately?
>

Roger showed up briefly few months ago then silence again. I wish him and everybody a happy new year.

Bruno

 

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Jan Simon

Date: 7 Jan, 2010 08:48:05

Message: 41 of 56

Dear Sudheer!

Bump. My idea might have been lost in the discussion.

Did you try this?
http://www.mathworks.com/matlabcentral/newsreader/view_thread/269584#706578

Jan

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Bruno Luong

Date: 7 Jan, 2010 12:37:05

Message: 42 of 56

"Jan Simon" <matlab.THIS_YEAR@nMINUSsimon.de> wrote in message <hi4745$erg$1@fred.mathworks.com>...
> Dear Sudheer!
>
> Bump. My idea might have been lost in the discussion.
>
> Did you try this?
> http://www.mathworks.com/matlabcentral/newsreader/view_thread/269584#706578

I did, it is a very good method speed wise, a little bit hard to read for someone who starts to use Matlab. The two sort commands save a bit of runtime compared to SORTROWS and calls for the fact that Matlab SORT is an *stable* sorting algorithm (prevail the initial order for draw elements) which not many people know how to exploit efficiently, so thank you for having showed us an example.

Bruno

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Jan Simon

Date: 7 Jan, 2010 13:17:23

Message: 43 of 56

Dear Bruno!

> I did,

Thank you for testing. I actually thought, this is a job for Sudheer with his original data. Perhaps, I don't want to be bold, but not too shy also: If you could stop doing Sudheer's work and start programming the STORTIND for me... ?!

Perhaps if I'd try the modern tricks?
PLZZZZZ.. Urgent & Emergency & Tomorrow is due & Grandfather is ill !!!..

I feel very junior today.

> a little bit hard to read for someone who starts to use Matlab.

Sorry! I should be improved optically. What about:
  [v, ind1] = sort(x(:, 2));
  [v, ind2] = sort(x(ind1, 1));
==>
  [theRightInstanceIsTaken, theRightInstanceIsTaken] = sort(x(:, 2));
  [theRightInstanceIsTaken2, theRightInstanceIsTaken2] = ...
    sort(x(theRightInstanceIsTaken, 1));
Better. Not Matlab-poetry, but better. I feel very senior today.

Sorry for taking your time - *I* had my funny coffee break, Jan

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Bruno Luong

Date: 9 Jan, 2010 11:47:04

Message: 44 of 56

"Jan Simon" <matlab.THIS_YEAR@nMINUSsimon.de> wrote in message <hi2u1m$pg5$1@fred.mathworks.com>...
> Dear Bruno!
>
> > Jan, how much do you estimate an inplace sorting (e.g., on large double array) would save time?
>
> It is clear, that the creation of the sort index is the demanding part of SORT, while the copy of the input in sorted order is secondary - usually. Unfortuantely one of the 2 DIMM ports of my computer is damaged and I have to live with 512MB RAM. But saving temporarily used memory is always useful, even on a 16 GB machine. Nevertheless, timing matters for 8MB array already:
>
> For the estimation of the speed gain:
> x = rand(1e6, 1);
> tic; y = sort(x); toc ==> 0.26 sec
> tic; [y, s] = sort(x); toc ==> 0.40 sec
> tic; y = x(s); toc ==> 0.14 sec
>
> So I assume I could save 35% computing time.

Bad news, I implement an in-place 1D quicksort (the unstable version) and it get beaten by Matlab SORT with two outputs calling. My Matlab is 2010A prerelease; I compile the mex with MSVC 6. My guess is inplace sorting requires accessing in cascade the double data, and it slows the thing down.

So I won't go any further. Better spend the effort on something else.

Bruno

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Bruno Luong

Date: 9 Jan, 2010 11:56:04

Message: 45 of 56

"Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <hi9qbn$qmn$1@fred.mathworks.com>...

>
> Bad news, I implement an in-place 1D quicksort (the unstable version) and it get beaten by Matlab SORT with two outputs calling. My Matlab is 2010A prerelease; I compile the mex with MSVC 6. My guess is inplace sorting requires accessing in cascade the double data, and it slows the thing down.

To avoid confusion, what I call "inplace sorting" is an implementation of the algorithm that works on the indexes without touching the original data.

Bruno

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Jan Simon

Date: 9 Jan, 2010 13:11:04

Message: 46 of 56

Dear Bruno!

> Bad news, I implement an in-place 1D quicksort (the unstable version) and it get beaten by Matlab SORT with two outputs calling. My Matlab is 2010A prerelease; I compile the mex with MSVC 6. My guess is inplace sorting requires accessing in cascade the double data, and it slows the thing down.

Your bad news are good news, if I follow the 2nd part of your advice:

> So I won't go any further. Better spend the effort on something else.

I'll spend my effort on something else and the overall speedup gets higher!
Thank you very much, Bruno!

Jan

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Bruno Luong

Date: 9 Jan, 2010 14:43:03

Message: 47 of 56

"Jan Simon" <matlab.THIS_YEAR@nMINUSsimon.de> wrote in message <hi9v98$qr5$1@fred.mathworks.com>...

>
> Your bad news are good news, if I follow the 2nd part of your advice:

Wait a minute Jan, then there is a new bad news ;-) : I modified an indexing in my code, and now it's faster than Matlab (about -50% for 1e6 elements).

Bruno

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Bruno Luong

Date: 9 Jan, 2010 14:50:04

Message: 48 of 56

Arrg forget it, the test is not valid. Sorry.

>
> Wait a minute Jan, then there is a new bad news ;-) : I modified an indexing in my code, and now it's faster than Matlab (about -50% for 1e6 elements).
>
> Bruno

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Jan Simon

Date: 9 Jan, 2010 16:29:03

Message: 49 of 56

Dear Bruno!

I dig in source of quickersort and stabilized quicksort variants.
Especially the reduction of recursion is interesting for large arrays, e.g. see:
http://www.angelfire.com/pq/jamesbarbetti/articles/sorting/001_QuicksortIsBroken.htm
Then it seems, like it is worth to stop quicksorting and start insort, if the block is smaller than 9 (or 15 according to Knuth) elements.

But simultaneously I try to find an efficient C-algorithm to create permutations: Choose K elements from a vector without repetitions and with order. I hope I do not confuse both programming threads.

CSSMers on saturdays. Jan

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Bruno Luong

Date: 9 Jan, 2010 19:06:03

Message: 50 of 56

"Jan Simon" <matlab.THIS_YEAR@nMINUSsimon.de> wrote in message <hiaasf$mge$1@fred.mathworks.com>...
> Dear Bruno!
>
> I dig in source of quickersort and stabilized quicksort variants.
> Especially the reduction of recursion is interesting for large arrays, e.g. see:
> http://www.angelfire.com/pq/jamesbarbetti/articles/sorting/001_QuicksortIsBroken.htm
> Then it seems, like it is worth to stop quicksorting and start insort, if the block is smaller than 9 (or 15 according to Knuth) elements.

This is exactly what I have done. My implementation of the sort algorithms are adapted from the sparse subindex package on FEX. I'll post the C-mex file of the stable quick sort in the next (long). For large arrays (1e6-1e7 elements), Matlab beats my code by 2-3 time in speed.

>> A=randn(1,1e6);
>> tic; I=sortidx_mex(A); toc
Elapsed time is 0.295162 seconds.
>> tic; [trash Imatlab]=sort(A); toc
Elapsed time is 0.140457 seconds.
>> isequal(I,Imatlab)

ans =

     1

>>

>
> But simultaneously I try to find an efficient C-algorithm to create permutations: Choose K elements from a vector without repetitions and with order. I hope I do not confuse both programming threads.
>

If you want you can compare with my implementation MIN/MAX Selection on FEX.

Bruno

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Bruno Luong

Date: 9 Jan, 2010 19:07:02

Message: 51 of 56

Code for you Jan:

/**************************************************************************
 * MATLAB mex function sort_idx.c
 * Calling syntax:
 * >> I = sort_idx(A)
 * returns I such that A(I) is sorted
 * Work in progress
 * Date: 09-Jan-2010
 *************************************************************************/

#include "mex.h"
#include "matrix.h"
#include "string.h"
#include "math.h"

/* Define correct type depending on platform */
#if defined(_MSC_VER) || defined(__BORLANDC__)
typedef unsigned __int32 uint32;
typedef signed __int32 int32;
typedef unsigned __int64 ulong64;
typedef signed __int64 long64;
#define INTMIN64 0x8000000000000000
#else
typedef unsigned int uint32;
typedef signed int int32;
typedef unsigned long long ulong64;
typedef signed long long long64;
#define INTMIN64 0x8000000000000000ULL
#endif

/* Different options for Pivot selection */
/* Different options for Pivot selection */
#define MIDPOINT 0
#define MEDIAN3 1
#define MEDIANMEDIANS 2

/* Pivot Strategy, use one of the above */
#define PIVOT MEDIAN3

#define INSERTSORT 0
#define SHELLSORT 1
#define QUICKSORT 2
#define HEAPSORT 3
#define INTROSORT 4

/* Method for sorting small arrays */
/* INSERTSORT | SHELLSORT */
#define BASESORT INSERTSORT

/* Method dor sorting large arrays */
/* QUICKSORT | HEAPSORT | INTROSORT */
#define MAINSORT QUICKSORT

/* Pivot Strategy, use one of the above */
#define PIVOT MEDIAN3

/* Increment for shellshort */
#define SizeIncs 3
mwIndex incs[SizeIncs] = { 7, 3, 1 };

/* A limit below the recusion stops, we which we switch to InsertionSort */
/* This must be smaller than */
#define SWITCH_ALGO_THRESHOLD 12

/* TYPE */
/* Type of elements to be sorted */
#define ARRAYTYPE double

/* Global variables, used to avoid stacking them during recusive call since
   they do not change */
mwIndex *pos;
ARRAYTYPE *list;

/*************************************************************************/
/* Insertion sort, this is stable */
void InsertionSort(mwIndex left, mwIndex right)
{
    mwIndex pival;
    mwIndex *pi, *pj, *pleft, *pright;
    ARRAYTYPE Value;
    
    if (right > left) {
        
        pleft = pos+left;
        pright = pos+right;
        for (pi=pleft+1; pi<=pright; pi++) {
            Value = list[pival=(*pi)];
            pj = pi-1;
            while ((pj>=pleft) && (list[*pj]>Value)) { /* Comparison */
                *(pj+1) = *pj; /* Permute */
                pj--;
            }
            *(pj+1) = pival; /* Insert */
        }
    }
    
    return;
} /* InsertionSort */

/*************************************************************************/

/* Shell sort */
void ShellSort(mwIndex left, mwIndex right)
{
    mwIndex pival, pjval, h, k;
    mwIndex *pi, *pj, *ph;
    mwIndex *pleft, *pright;
    ARRAYTYPE Value;
    
    if (right > left) {
        
        pleft = pos+left;
        pright = pos+right;
        
        /* Loop on gap size */
        for ( k = 0; k < SizeIncs; k++) {
            h = incs[k];
            ph = pos + h;
            /* Insertion loop */
            for (pi=pleft+h; pi<=pright; pi++)
            {
                Value = list[pival=*pi];
                pj = pi;
                while ((pj>=ph) && (list[pjval=*(pj-h)]>Value)) { /* Comparison */
                    *pj = pjval; /* Permute */
                    pj -= h;
                }
                *pj = pival; /* Insert */
            }
        }
    }
    
    return;
} /* ShellSort */

/*************************************************************************/
/*Find the index of the Median of the elements
of array that occur at every "shift" positions.*/
mwIndex findMedianIndex(mwIndex left, mwIndex right, mwIndex shift)
{
    mwIndex tmp, groups, n;
    ARRAYTYPE minValue;
    mwIndex *pi, *pj, *pk, *pright, *pminIndex;
    
    groups = (right-left)/shift + 1;
    pk = pos + (n = left + (groups/2)*shift);
    pright = pos + right;
    for (pi=pos+left; pi<=pk; pi+= shift)
    {
        pminIndex = pi;
        minValue = list[*pminIndex];
        
        for (pj=pi; pj<=pright; pj+=shift)
            if (list[*pj]<minValue) /* Comparison */
                minValue = list[*(pminIndex=pj)];
        /* Swap pos[i] with pos[minIndex] */
        tmp = *pi;
        *pi = *pminIndex;
        *pminIndex = tmp;
    }
    
    return n;
    
} /* findMedianIndex */

/*Computes the median of each group of 5 elements and stores
  it as the first element of the group (left). Recursively does this
  till there is only one group and hence only one Median */
mwIndex findMedianOfMedians(mwIndex left, mwIndex right)
{
    mwIndex i, shift, step, tmp;
    mwIndex endIndex, medianIndex;
           
    if (left==right) return left;
   
    shift = 1;
    while (shift <= (right-left))
    {
        step=shift*5;
        for (i=left; i<=right; i+=step)
        {
            if ((endIndex=i+step-1)>=right)
                endIndex=right;
            medianIndex = findMedianIndex(i, endIndex, shift);
            /* Swap pos[i] with pos[medianIndex] */
            tmp = pos[i];
            pos[i] = pos[medianIndex];
            pos[medianIndex] = tmp;
        }
        shift = step;
    }
    return left;
} /* findMedianOfMedians */

/*************************************************************************/
/*Computes the median of three points (left,right,and mid) */
mwIndex findMedianThree(mwIndex left, mwIndex right)
{
    ARRAYTYPE vleft, vright, vmid;
    mwIndex mid;
    
    if (left==right) return left;
    
    vleft = list[pos[left]];
    vright = list[pos[right]];
    vmid = list[mid = (left+right+1)/2];
    if (vleft<vright)
    {
        if (vmid>vright)
            return right;
        else if (vmid<vleft)
            return left;
        else
            return mid;
        
    } else { /* (vleft>=vright) */
        
        if (vmid>vleft)
            return left;
        else if (vmid<vright)
            return right;
        else
            return mid;
        
    }
} /* findMedianThree */

/*************************************************************************/
/* Partitioning the list around pivot pivotValue := l[pivotIndex];
 * After runing, at exit we obtain:
   l[left]...l[index-1] < pivotValue <= l[index] ... l[right]
   where l[i] := list[pos[i]] for all i */
mwIndex PartitionV2(mwIndex left, mwIndex right, mwIndex pivotIndex) {
    
    ARRAYTYPE pivotValue, li;
    mwIndex *pindex, *pright, *pleft;
    mwIndex *pfirst, *plast;
    mwIndex tmp, ip, il, ir;
    
    plast=pos+right;
    pindex=pos+pivotIndex;
    pivotValue = list[tmp = *pindex];
    /* Swap pos[pivotIndex] with pos[right] */
    *pindex = *plast;
    ip = *plast = tmp;
    
    pfirst = pleft = pos+left;
    pright = plast-1;
    for (;;) {
        
        li = list[il = *pleft];
        while ((li<pivotValue) ||
               (li==pivotValue && il<ip)) /* <- test for stable quicksort */
            li = list[il = *(++pleft)];
        
        li = list[ir = *(pright)];
        while ((pright>pleft) &&
               ((li>pivotValue) ||
                (li==pivotValue && ir>ip))) /* <- test for stable quicksort */
            li = list[ir = *(--pright)];
        
        if (pleft<pright) {
            *pleft = ir;
            *pright = il;
            ++pleft, --pright;
        }
        else {
            *pleft = *plast;
            *plast = il;
            return (mwIndex)(pleft-pos); /* Pointer arithmetic */
        }
    }
} /* PartitionV2 */

int CheckPartition(mwIndex left, mwIndex right, mwIndex pivotIndex)
{
    mwIndex i;
    ARRAYTYPE pivotValue;
    
    pivotValue = list[pos[pivotIndex]];
    for (i=left;i<=pivotIndex-1;i++)
        if (list[pos[i]]>pivotValue)
            return 0;
    for (i=right;i>=pivotIndex+1;i--)
        if (list[pos[i]]<pivotValue)
            return 0;
    return 1;
}

/*************************************************************************/
void DownHeap(mwIndex i, mwIndex n, mwIndex lo) {
    ARRAYTYPE d;
    mwIndex child, posd, nhalf;
    mwIndex *plo, *pchild;
    
    plo = pos + lo;
    posd = *(plo+i-1);
    d = list[posd];
    nhalf = n/2;
    while (i<=nhalf) {
        child = 2*i;
        pchild = plo + child;
        if ((child < n) &&
            (list[*(pchild-1)] < list[*pchild]))
        {
            child++;
            pchild++;
        }
        if (d >= list[*(pchild-1)]) break;
        *(plo+i-1) = *(pchild-1);
        i = child;
    }
    *(plo+i-1) = posd;
    
    return;
} /* DownHeap */

/*http://users.encs.concordia.ca/~chvatal/notes/hsort.html*/
void HeapSort(mwIndex lo, mwIndex hi) { /* (lo,hi) included */
    mwIndex n, i, tmp;
    mwIndex *plo;
    
    hi++;
    n = hi-lo;
            
    for (i=n/2; i>=1; i--)
        DownHeap(i, n, lo);

    plo = pos + lo;
    for (i=n; i>1; i--) {
        
        tmp = *(plo);
        *(plo) = *(plo+i-1);
        *(plo+i-1) = tmp;
        
        DownHeap(1, i-1, lo);
    }
    return;
    
} /* HeapSort */

/*************************************************************************/

void introsort_loop(mwIndex lo, mwIndex hi, int depth_limit)
 {
    mwIndex pivotIndex;
    while (hi+1 >= lo + SWITCH_ALGO_THRESHOLD) {
        if (depth_limit == 0) {
            HeapSort(lo, hi);
            return;
        }
        depth_limit--;
        
#if (PIVOT==MEDIANMEDIANS)
        pivotIndex = findMedianOfMedians(lo, hi);
#elif (PIVOT==MEDIAN3)
        pivotIndex = findMedianThree(lo, hi);
#else /* MIDPOINT */
        pivotIndex = (left+right+1)/2;
#endif
        pivotIndex = PartitionV2(lo, hi, pivotIndex);
        introsort_loop(pivotIndex+1, hi, depth_limit);
        hi = pivotIndex-1;
    }
    return;
} /* introsort_loop */

int floor_lg(mwIndex a) {
    return (int)(floor(log((double)a)/log(2.0)));
} /* floor_lg */


void IntroSort(mwIndex left, mwIndex right)
{
    int depthlimit;
    depthlimit=2*floor_lg(right-left+1);
    
    introsort_loop(left, right, depthlimit);
#if (BASESORT==INSERTSORT)
    InsertionSort(left, right);
#else /* #if (BASESORT==SHELLSORT)*/
    ShellSort(left, right);
#endif
    
    return;
} /* IntroSort */

/*************************************************************************/
/* Recursive engine quicksort */
void quicksort(mwIndex left, mwIndex right) {
    
    mwIndex pivotIndex;

    /* Switch to Insertion sort for small size array */
    if (right+1 < left + SWITCH_ALGO_THRESHOLD)
    {
#if (BASESORT==INSERTSORT)
        InsertionSort(left, right);
#else /* #if (BASESORT==SHELLSORT)*/
        ShellSort(left, right);
#endif
    }
    else
    {
#if (PIVOT==MEDIANMEDIANS)
        pivotIndex = findMedianOfMedians(left, right);
#elif (PIVOT==MEDIAN3)
        pivotIndex = findMedianThree(left, right);
#else /* MIDPOINT */
        pivotIndex = (left+right+1)/2;
#endif
        
        pivotIndex = PartitionV2(left, right, pivotIndex);
        quicksort(left, pivotIndex-1);
        quicksort(pivotIndex+1, right);
    }
    
    return;
} /* quicksort */

/* Allocate and Initialize the table of position before sorting */
void AllocatePos(mwIndex l)
{
    mwIndex n;
    
    /* clean programming */
    pos=NULL;
   
    if (l>0) {
        pos = mxMalloc(sizeof(mwIndex)*l);
        if (pos==NULL)
            mexErrMsgTxt("Out of memory (pos).");
        /* Initialize the array of position (zero-based index) */
        for (n=0; n<l; n++) pos[n]=n;
    }
    
    return;
} /* AllocatePos */

/* Free the position */
void FreePos(void)
{
    /* Free the array of position */
    if (pos) mxFree(pos);
    pos = NULL; /* clean programming */
    
    /* Free the array of values int64 */
    /* if (list) mxFree(list); */
    list = NULL; /* clean programming */
    
    return;
} /* FreePos */

         
/* Return the platform dependent maximum size of for mwIndex */
mxClassID IdxClass(void) {
    switch (sizeof(mwIndex)) {
        case 4:
            return mxINT32_CLASS; /* 2^31-1 */
        case 8:
            return mxINT64_CLASS; /* 2^48-1 */
    }
} /* MaxSize */

/* MainSortDbl, sorting interface for double */
void MainSortDbl(ARRAYTYPE *list, mwIndex l) {

    mwIndex n;
    
    /* Work for non-empty array */
    if (l>1 && list!=NULL && pos!=NULL) {
        /* Call the recursive engine */
#if (MAINSORT==QUICKSORT)
        quicksort(0, l-1);
#elif (MAINSORT==INTROSORT)
        IntroSort(0, l-1);
#elif (MAINSORT==HEAPSORT)
        HeapSort(0, l-1);
#else
#error "Non valid MAINSORT"
#endif
    } /* if (l>0) */
    
    /* Convert zero-base indexing to Matlab one-base */
    for (n=0; n<l; n++) (pos[n])++;
   
    return;

} /* MainSortDbl */

/* Free a pointer */
void* MyFree(void* Ptr)
{
    if (Ptr)
        mxFree(Ptr);
    return NULL;
} /* Free */


/* Gateway routine sort_idx */
void mexFunction(int nlhs, mxArray *plhs[],
        int nrhs, const mxArray *prhs[]) {
    
    const mxArray *A;
    mwIndex m, n, l;
    mxClassID DataClass;
    mxArray* I;
    mwSize dimI[2];
    
    if (nrhs!=1)
        mexErrMsgTxt("One input argument is required.");
    
    /* Check data type of input argument */
    
    A = prhs[0];
    if (mxIsSparse(A))
        mexErrMsgTxt("First input argument must be a full matrix.");
    
    DataClass = mxGetClassID(A);
    if ((DataClass != mxDOUBLE_CLASS))
        mexErrMsgTxt("A must be of class DOUBLE.");
    
    if (mxGetNumberOfDimensions(A) != 2)
        mexErrMsgTxt("First input argument must be a vector.");
    
    m = (mwIndex)mxGetM(A);
    n = (mwIndex)mxGetN(A);
    
    if ((m > 1) && (n > 1))
        mexErrMsgTxt("First input argument must be a vector.");
    
    /* Get the total number elements of A */
    l = m*n;
    
    /* Allocate the array of index and get the data pointer */
    AllocatePos(l);
    /* Pointers on data */
    list = mxGetPr(A);
    
    /* Sort the indexes, the result is put on pos[] */
    MainSortDbl(list, l);
    
    /* Return the index array */
    dimI[0] = 0; dimI[1] = 0;
    I = mxCreateNumericArray(2, dimI, IdxClass(), mxREAL);
    mxSetPr(I, pos);
    mxSetM(I, m);
    mxSetN(I, n);
    plhs[0] = I;
           
    return;
    
} /* sort_idx */

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Bruno Luong

Date: 9 Jan, 2010 19:28:04

Message: 52 of 56


> >
> > But simultaneously I try to find an efficient C-algorithm to create permutations: Choose K elements from a vector without repetitions and with order. I hope I do not confuse both programming threads.
> >
>
> If you want you can compare with my implementation MIN/MAX Selection on FEX.
>

Sorry, I missread your post. Both problems have no common relation whatsoever. So please ignore.

Bruno

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Jan Simon

Date: 9 Jan, 2010 22:59:04

Message: 53 of 56

Dear Bruno!

Nice piece of code.
While I can often read some Matlab Haikus, which gain the efficiency from the compactness of Matlab, this sort squeezes the dull simplicity of C.
Thanks! Jan

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Bruno Luong

Date: 10 Jan, 2010 12:30:19

Message: 54 of 56

"Jan Simon" <matlab.THIS_YEAR@nMINUSsimon.de> wrote in message <hib1no$q8a$1@fred.mathworks.com>...
> Dear Bruno!
>
> Nice piece of code.
> While I can often read some Matlab Haikus, which gain the efficiency from the compactness of Matlab, this sort squeezes the dull simplicity of C.
> Thanks! Jan

Still I'm intrigued: why my code 2-3 times slower than Matlab. Is it due to double indexing because of the reference sorting as I suspected? Or Matlab library uses faster compiler (my code compiled with LCC is much slower - a factor 2 with respect to MSVC)? Algorithm optimization? Something else?

I have cleaned up the code a little bit to make it slightly faster, but it get nowhere close to Matlab sorting speed.

Bruno

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Jan Simon

Date: 10 Jan, 2010 18:29:03

Message: 55 of 56

Dear Bruno!

> I have cleaned up the code a little bit to make it slightly faster, but it get nowhere close to Matlab sorting speed.

After I've read your code, I found this:
  http://www.netlib.org/napack/sort.f
I cannot find the correct terms to describe my reaction - so I stay at :-)

LCC3.8 from the net creates faster code. The MEXOPTS.BAT needs the surprising modification, that to RSP-file indicator like "@" can be used with this LCC.

Let's ask Loren for the source of SORT - I do not know, if it is her code, but I assume, she knows the author at least.

Kind regards, Jan

Subject: Getting indexes of rows of matrix with more than n repetitions

From: Bruno Luong

Date: 10 Jan, 2010 22:06:03

Message: 56 of 56

There are few bugs in my previous indexing sorting code. Here is a new one. The default quicksort is out: none of the pivoting strategy I tested is completely satisfied.

Bruno

/**************************************************************************
 * MATLAB mex function sort_idx.c
 * Calling syntax:
 * >> I = sort_idx(A)
 * returns I such that A(I) is sorted
 *
 * Work in still in progress...
 *
 * Compile on 32-bit platform
 * mex -O -v sort_idx.c
 * On 64-bit platform
 * mex -v -O -largeArrayDims sort_idx.c
 *
 * Author Bruno Luong <brunoluong@?????.com>
 * Date: 10-Jan-2010
 *************************************************************************/

#include "mex.h"
#include "matrix.h"
#include "string.h"
#include "math.h"

/* Define correct type depending on platform */
#if defined(_MSC_VER) || defined(__BORLANDC__)
typedef unsigned __int32 uint32;
typedef signed __int32 int32;
typedef unsigned __int64 ulong64;
typedef signed __int64 long64;
#define INTMIN64 0x8000000000000000
#else
typedef unsigned int uint32;
typedef signed int int32;
typedef unsigned long long ulong64;
typedef signed long long long64;
#define INTMIN64 0x8000000000000000ULL
#endif

/* Different options for Pivot selection */
/* Different options for Pivot selection */
#define MIDPOINT 0
#define MEDIAN3 1
#define MEDIANMEDIANS 2

/* Pivot Strategy, use one of the above */
#define PIVOT MEDIAN3

#define INSERTSORT 0
#define SHELLSORT 1
#define QUICKSORT 2
#define HEAPSORT 3
#define INTROSORT 4

/* Method for sorting small arrays */
/* INSERTSORT | SHELLSORT */
#define BASESORT INSERTSORT

/* Method dor sorting large arrays */
/* QUICKSORT | HEAPSORT | INTROSORT */
#define MAINSORT INTROSORT

/* Increment for shellshort */
#define SizeIncs 3
mwIndex incs[SizeIncs] = { 7, 3, 1 };

/* A limit below the recusion stops, we which we switch to InsertionSort */
/* This must be smaller than */
#define SWITCH_ALGO_THRESHOLD 12

/* TYPE */
/* Type of elements to be sorted */
#define ARRAYTYPE double

/* Global variables, used to avoid stacking them during recusive call since
   they do not change */
mwIndex *pos;
ARRAYTYPE *list;

/*************************************************************************/
/* Insertion sort, this is stable */
void InsertionSort(mwIndex left, mwIndex right)
{
    mwIndex pival;
    mwIndex *pi, *pj, *pleft, *pright;
    ARRAYTYPE Value;

    pleft = pos+left;
    pright = pos+right;
    for (pi=pleft; pi<pright; pi++) {
        Value = list[pival=(*(pi+1))];
        pj = pi;
        
        while ((pj>=pleft) && (list[*pj]>Value)) { /* Comparison */
            *(pj+1) = *pj; /* Permute */
            pj--;
        }
        *(pj+1) = pival; /* Insert */
    } /* for */
    
    return;
} /* InsertionSort */

/*************************************************************************/

/* Shell sort */
void ShellSort(mwIndex left, mwIndex right)
{
    mwIndex pival, pjval, h, k;
    mwIndex *pi, *pj, *ph;
    mwIndex *pleft, *pright;
    ARRAYTYPE Value;
    
    if (right > left) {
        
        pleft = pos+left;
        pright = pos+right;
        
        /* Loop on gap size */
        for ( k = 0; k < SizeIncs; k++) {
            h = incs[k];
            ph = pos + h;
            /* Insertion loop */
            for (pi=pleft+h; pi<=pright; pi++)
            {
                Value = list[pival=*pi];
                pj = pi;
                while ((pj>=ph) && (list[pjval=*(pj-h)]>Value)) { /* Comparison */
                    *pj = pjval; /* Permute */
                    pj -= h;
                }
                *pj = pival; /* Insert */
            }
        }
    }
    
    return;
} /* ShellSort */

/*************************************************************************/
/*Find the index of the Median of the elements
of array that occur at every "shift" positions.*/
mwIndex findMedianIndex(mwIndex left, mwIndex right, mwIndex shift)
{
    mwIndex tmp, groups, n;
    ARRAYTYPE minValue;
    mwIndex *pi, *pj, *pk, *pright, *pminIndex;
    
    groups = (right-left)/shift + 1;
    pk = pos + (n = left + (groups/2)*shift);
    pright = pos + right;
    for (pi=pos+left; pi<=pk; pi+= shift)
    {
        pminIndex = pi;
        minValue = list[*pminIndex];
        
        for (pj=pi; pj<=pright; pj+=shift)
            if (list[*pj]<minValue) /* Comparison */
                minValue = list[*(pminIndex=pj)];
        /* Swap pos[i] with pos[minIndex] */
        tmp = *pi;
        *pi = *pminIndex;
        *pminIndex = tmp;
    }
    
    return n;
    
} /* findMedianIndex */

/*Computes the median of each group of 5 elements and stores
  it as the first element of the group (left). Recursively does this
  till there is only one group and hence only one Median */
mwIndex findMedianOfMedians(mwIndex left, mwIndex right)
{
    mwIndex i, shift, step, tmp;
    mwIndex endIndex, medianIndex;
           
    if (left==right) return left;
   
    shift = 1;
    while (shift <= (right-left))
    {
        step=shift*5;
        for (i=left; i<=right; i+=step)
        {
            if ((endIndex=i+step-1)>=right)
                endIndex=right;
            medianIndex = findMedianIndex(i, endIndex, shift);
            /* Swap pos[i] with pos[medianIndex] */
            tmp = pos[i];
            pos[i] = pos[medianIndex];
            pos[medianIndex] = tmp;
        }
        shift = step;
    }
    return left;
} /* findMedianOfMedians */

/*************************************************************************/
/*Computes the median of three points (left,right,and mid) */
mwIndex findMedianThree(mwIndex left, mwIndex right)
{
    ARRAYTYPE vleft, vright, vmid;
    mwIndex mid;
    
    if (left==right) return left;
    
    vleft = list[pos[left]];
    vright = list[pos[right]];
    vmid = list[pos[mid = (left+right+1)/2]];
    
    if (vleft<vright)
    {
        if (vmid>vright)
            return right;
        else if (vmid<vleft)
            return left;
        else
            return mid;
        
    } else { /* (vleft>=vright) */
        
        if (vmid>vleft)
            return left;
        else if (vmid<vright)
            return right;
        else
            return mid;
        
    }
} /* findMedianThree */

/*************************************************************************/
/* Partitioning the list around pivot pivotValue := l[pivotIndex];
 * After runing, at exit we obtain:
   l[left]...l[index-1] < pivotValue <= l[index] ... l[right]
   where l[i] := list[pos[i]] for all i */
mwIndex PartitionV2(mwIndex left, mwIndex right, mwIndex pivotIndex) {
    
    ARRAYTYPE pivotValue, li;
    mwIndex *pindex, *pright, *pleft;
    mwIndex *plast;
    mwIndex ip, il, ir;
    
    plast=pos+right;
    pindex=pos+pivotIndex;
    pivotValue = list[ip = *pindex];
    /* Swap pos[pivotIndex] with pos[right] */
    *pindex = *plast;
    *plast = ip;
    
    pleft = pos+left;
    pright = plast-1;
    while (1) {
        
        li = list[il = *pleft];
        while ((li<pivotValue) ||
               (li==pivotValue && il<ip)) /* <- test for stable sort */
            li = list[il = *(++pleft)];
        
        li = list[ir = *(pright)];
        while ((pright>pleft) &&
               ((li>pivotValue) ||
                (li==pivotValue && ir>ip))) /* <- test for stable sort */
            li = list[ir = *(--pright)];
        
        if (pleft<pright) {
            *(pleft++) = ir;
            *(pright--) = il;
        }
        else {
            /* swap left with the pivot */
            *pleft = *plast;
            *plast = il;
            return (mwIndex)(pleft-pos); /* Pointer arithmetic */
        }
    }
} /* PartitionV2 */

/*************************************************************************/
void DownHeap(mwIndex i, mwIndex n, mwIndex lo) {
    ARRAYTYPE d;
    mwIndex child, posd, nhalf;
    mwIndex *plo, *pchild;
    
    plo = pos + lo;
    posd = *(plo+i-1);
    d = list[posd];
    nhalf = n/2;
    while (i<=nhalf) {
        child = 2*i;
        pchild = plo + child;
        if ((child < n) &&
            (list[*(pchild-1)] < list[*pchild]))
        {
            child++;
            pchild++;
        }
        if (d >= list[*(pchild-1)]) break;
        *(plo+i-1) = *(pchild-1);
        i = child;
    }
    *(plo+i-1) = posd;
    
    return;
} /* DownHeap */

/*http://users.encs.concordia.ca/~chvatal/notes/hsort.html*/
void HeapSort(mwIndex lo, mwIndex hi) { /* (lo,hi) included */
    mwIndex n, i, tmp;
    mwIndex *plo;
    
    hi++;
    n = hi-lo;
            
    for (i=n/2; i>=1; i--)
        DownHeap(i, n, lo);

    plo = pos + lo;
    for (i=n; i>1; i--) {
        
        tmp = *(plo);
        *(plo) = *(plo+i-1);
        *(plo+i-1) = tmp;
        
        DownHeap(1, i-1, lo);
    }
    return;
    
} /* HeapSort */

/*************************************************************************/

void introsort_loop(mwIndex lo, mwIndex hi, int depth_limit)
 {
    mwIndex pivotIndex;
    while (hi+1 >= lo + SWITCH_ALGO_THRESHOLD) {
        if (depth_limit == 0) {
            HeapSort(lo, hi);
            return;
        }
        depth_limit--;
        
#if (PIVOT==MEDIANMEDIANS)
        pivotIndex = findMedianOfMedians(lo, hi);
#elif (PIVOT==MEDIAN3)
        pivotIndex = findMedianThree(lo, hi);
#else /* MIDPOINT */
        pivotIndex = (lo+hi+1)/2;
#endif
        pivotIndex = PartitionV2(lo, hi, pivotIndex);
        introsort_loop(pivotIndex+1, hi, depth_limit);
        hi = pivotIndex-1;
    }
    return;
} /* introsort_loop */

int floor_lg(mwIndex a) {
    return (int)(floor(log((double)a)/log(2.0)));
} /* floor_lg */


void IntroSort(mwIndex left, mwIndex right)
{
    int depthlimit;
    depthlimit=2*floor_lg(right-left+1);
    
    introsort_loop(left, right, depthlimit);
#if (BASESORT==INSERTSORT)
    InsertionSort(left, right);
#else /* #if (BASESORT==SHELLSORT)*/
    ShellSort(left, right);
#endif
    
    return;
} /* IntroSort */

/*************************************************************************/
/* Recursive engine quicksort */
void quicksort(mwIndex left, mwIndex right) {
    
    mwIndex pivotIndex;

    /* Switch to Insertion sort for small size array */
    if (right+1 < left + SWITCH_ALGO_THRESHOLD)
    {
#if (BASESORT==INSERTSORT)
        InsertionSort(left, right);
#else /* #if (BASESORT==SHELLSORT)*/
        ShellSort(left, right);
#endif
    }
    else
    {
#if (PIVOT==MEDIANMEDIANS)
        pivotIndex = findMedianOfMedians(left, right);
#elif (PIVOT==MEDIAN3)
        pivotIndex = findMedianThree(left, right);
#else /* MIDPOINT */
        pivotIndex = (left+right+1)/2;
#endif
        
        pivotIndex = PartitionV2(left, right, pivotIndex);
        quicksort(left, pivotIndex-1);
        quicksort(pivotIndex+1, right);
    }
    
    return;
} /* quicksort */

/* Allocate and Initialize the table of position before sorting */
void AllocatePos(mwIndex l)
{
    mwIndex n;
    
    /* clean programming */
    pos=NULL;
   
    if (l>0) {
        pos = mxMalloc(sizeof(mwIndex)*l);
        if (pos==NULL)
            mexErrMsgTxt("Out of memory (pos).");
        /* Initialize the array of position (one-based Matlab index) */
        pos--;
        for (n=1; n<=l; n++) pos[n]=n;
        pos++;
    }
    
    return;
} /* AllocatePos */

         
/* Return the platform dependent the class corresponding to mwIndex */
mxClassID IdxClass(void) {
    switch (sizeof(mwIndex)) {
        case 4:
            return mxINT32_CLASS; /* 2^31-1 */
        case 8:
            return mxINT64_CLASS; /* 2^48-1 */
    }
} /* MaxSize */

/* MainSortDbl, sorting interface for double */
void MainSortDbl(mwIndex l) {
    
    /* Allocate the array of index and get the data pointer */
    AllocatePos(l);
    
    /* We reduce the pointer by 1 temporary
       to cope with Matlab one-base indexing */
    list--;
    /* Work for non-empty array */
    if (l>1 && list!=NULL && pos!=NULL) {
        /* Call the recursive engine */
#if (MAINSORT==QUICKSORT)
        quicksort(0, l-1);
#elif (MAINSORT==INTROSORT)
        IntroSort(0, l-1);
#elif (MAINSORT==HEAPSORT)
        HeapSort(0, l-1);
#else
#error "Non valid MAINSORT"
#endif
    } /* if (l>0) */
    list++;
   
    return;

} /* MainSortDbl */

/* Gateway routine sort_idx */
void mexFunction(int nlhs, mxArray *plhs[],
        int nrhs, const mxArray *prhs[]) {
    
    const mxArray *A;
    mwIndex m, n, l;
    mxClassID DataClass;
    mxArray* I;
    mwSize dimI[2];
    
    if (nrhs!=1)
        mexErrMsgTxt("One input argument is required.");
    
    /* Check data type of input argument */
    
    A = prhs[0];
    if (mxIsSparse(A))
        mexErrMsgTxt("First input argument must be a full matrix.");
    
    DataClass = mxGetClassID(A);
    if ((DataClass != mxDOUBLE_CLASS))
        mexErrMsgTxt("A must be of class DOUBLE.");
    
    if (mxGetNumberOfDimensions(A) != 2)
        mexErrMsgTxt("First input argument must be a vector.");
    
    m = (mwIndex)mxGetM(A);
    n = (mwIndex)mxGetN(A);
    
    if ((m > 1) && (n > 1))
        mexErrMsgTxt("First input argument must be a vector.");
    
    /* Get the total number elements of A */
    l = m*n;

    /* Pointers on data */
    list = mxGetPr(A);
    
    /* Sort the indexes, the result is put on pos[] */
    MainSortDbl(l);
    
    /* Return the index array */
    dimI[0] = 0; dimI[1] = 0;
    I = mxCreateNumericArray(2, dimI, IdxClass(), mxREAL);
    mxSetData(I, pos);
    mxSetM(I, m);
    mxSetN(I, n);
    plhs[0] = I;
           
    return;
    
} /* sort_idx */

Tags for this Thread

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us