File Exchange

image thumbnail


version (19.3 KB) by Jan
Random permutation of array elements, C-Mex: much faster than RANDPERM


Updated 06 Mar 2011

View Version History

View License

Shuffle - Random permutation of array elements

This function is equivalent to X(RANDPERM(LENGTH(X)), but 50% to 85% faster. It uses D.E. Knuth's shuffle algorithm (also called Fisher-Yates) and the cute KISS random number generator (G. Marsaglia). While RANDPERM needs 2*LENGTH(X)*8 bytes as temporary memory, SHUFFLE needs just a fixed small number of bytes.

1. Inplace shuffling: Y = Shuffle(X, Dim)
X: DOUBLE, SINGLE, CHAR, LOGICAL, (U)INT64/32/16/8 array.
Dim: Dimension to operate on. Optional, default: 1st non-singleton dimension.
Y: Array of same type and size as X with shuffled elements.

2. Create a shuffle index: Index = Shuffle(N, 'index', NOut)
This is equivalent to Matlab's RANDPERM, but much faster, if N is large and NOut is small.
N: Integer number.
NOut: The number of output elements. Optional, default: N.
Index: [1:NOut] elements of shuffled [1:N] vector in the smallest possible integer type.

3. Derangement index:
Index = Shuffle(N, 'derange', NOut)
Equivalent to the index method, but all Index[i] ~= i. A rejection method is used: Create an index vector until a derangement is gained.

R = Shuffle(1:8) % [8, 1, 2, 6, 4, 3, 5, 7]
R = Shuffle('abcdefg') % 'efbadcg'
R = Shuffle([1:4; 5:8], 2) % [3, 2, 1, 4; 6, 8, 7, 5]
I = Shuffle(8, 'index'); % UINT8([1, 5, 7, 6, 2, 3, 4, 8])
Choose 10 different rows from a 1000 x 100 matrix:
X = rand(1000, 100); Y = X(Shuffle(1000, 'index', 10), :);
Operate on cells or complex arrays:
C = {9, 's', 1:5}; SC = C(Shuffle(numel(C), 'index'));
M = rand(3) + i * rand(3); SM = M(:, Shuffle(size(C, 2), 'index'))

NOTES: There are several other shuffle functions in the FEX. Some use Knuth's method also, some call RANDPERM. This implementation is faster due to calling a compiled MEX file and it has a smaller memory footprint. The KISS random numbers are much better than the RAND() of the C-standard libs.

Run the unit-test TestShuffle to test validity and speed (see screenshot).
Tested: Matlab 6.5, 7.7, 7.8, 32bit, WinXP,
Compiler: LCC 2.4/3.8, BCC 5.5, Open Watcom 1.8, MSVC 2008.
Compatibility to 64 bit, Linux and Mac is assumed.
Pre-compiled Mex:

Cite As

Jan (2021). Shuffle (, MATLAB Central File Exchange. Retrieved .

Comments and Ratings (39)


@Mike Angstadt: Of course you can use a tool for a job, for which it is not efficient, e.g. because it is designed to solve another problem efficiently. Then your down voting means, that you blame the tool for not solving a problem, for which it is not made.
Calling Y = Shuffle(rand(6000, 87000)) is not efficient. Please read the documentation to find the better solution:
Y = X(Shuffle(6000, 'Index'), :), which has a comparable speed as the RANPERM approach. But even here Shuffle is not faster than current Matlab versions. For other jobs Shuffle is 4 times faster than RANDPERM and it is recommended to use it only for these cases.


@Cai Chin: As explained in the documentation you can set a seed:
Shuffle(randi([0, 2^32-1], 1, 4), 'seed').

Mike Angstadt

On R2015b for a large matrix (6000x87000) it takes about 3 times longer than just doing:
Xs = X(randperm(N),:);

Cai Chin

Is it possible to set a seed so the function acts constantly across several arrays?

James Tursa

Jan, are there plans to update this code for R2018a API? Looks like it would be fairly easy to add 16-byte shuffle routines and some minor up-front changes to accommodate the new interleaved complex data memory layout.

Duarte Pereira

Stephen Becker

This looks like very high quality code, and I got it to compile nicely on my computer, and running the unit test, it showed a great speedup over Matlab's "randperm", especially when I use the "index" mode and only want k of n entries (which is my typical use case). For example, it gave

Choose 1000 from [1 x 1000000], 10 loops:
RANDPERM(N), [1:nOut]: 0.678 sec
SHUFFLE(N, index, nOut)) Mex: 0.007 sec ==> 1.1% of RANDPERM

However, looking at the code for the unit test, it was calling randperm(n) and then sampling the output. It turns out Matlab updated "randperm" in R2011b so that it now supports a "randperm(n,k)" calling sequence. If I use that calling sequence in the unit test, the results are not so favorable any more:

Choose nOut (=1000) from [1 x 1000000], 14000 loops:
RANDPERM(N), [1:nOut]: 0.831 sec
SHUFFLE(N, 'index', nOut)) Mex: 13.433 sec ==> 1617.3% of RANDPERM

In other cases, the provided code certain does well, so it's competitive with Matlab's implementation, but in short, it seems this code was very useful because Matlab's old code was so bad, and now Mathworks has caught up and given a reasonable implementation.



@Stephen: The statement about X(RANDPERM(LENGTH(X)) concerns vectors only. For arrays this function Shuffle() operates on a specific dimension and there is no comparable one-line command containing RANDPERM. So for arrays Shuffle() does *not* perform X(randperm(numel(X))) .

Stephen Cobeldick

The description is incorrect, because |length| does not do what the author seems to think it does. For arrays |length| is not |numel|!. The description states that it does this (using |length|):

>> X = randi(9,3)
X =
9 4 3
6 3 7
4 6 1
>> X(randperm(length(X)))
ans =
4 6 9

Whereas the function actually returns _all_ elements, as per |numel|:

>> X(randperm(numel(X)))
ans =

1 7 6 3 4 6 9 4 3


Willem-Jan de Goeij

Dear Jan,

I've tested calling your function from inside a parfor loop. The results show that repetitions occur.

>> matlabpool local
Starting matlabpool using the 'local' configuration ... connected to 3 labs.
>> shuffleparfor
output =

8 1 3 7 5 2 6 4
2 3 8 7 1 5 6 4
8 1 3 7 5 2 6 4
2 3 8 7 1 5 6 4
8 1 3 7 5 2 6 4
2 3 8 7 1 5 6 4
4 8 7 3 6 2 5 1
4 8 7 3 6 2 5 1
4 8 7 3 6 2 5 1

output = zeros(9,8);
Shuffle(1234567890, 'seed');

parfor ii=1:9
output(ii, :) = Shuffle(8, 'index');

Willem-Jan de Goeij

Hi Jan,

In reply to your first comment from 31/07:
Setting the seed to randi([0, 2^32-1], 1, 4) in every parfor loop would be advisible, I guess?
On my computer 2^32-1+d becomes zero when casted to uint32_T, with d >= 0.999999762.

I'm setting seeds because I also want to be able to reproduce the results.

Willem-Jan de Goeij

I use parfor loops on my 64 bit computer with Matlab 64 bit.

I see there are only 32 bit mex compilations on your website.

Does anyone have a 64 bit mex compilation of this function?


@Willem-Jan: Sorry, I'm afraid that my advice was wrong. I do not have the Parallel Computing Toolbox, such that such that I cannot test it. Please check, if using the *same* seed in all PARFOR branchs lead to the *same* permutations.


@Willem-Jan: The problem for multiple seeding is to get independent seeds. Choosing a seed based on the current time is fair on a single thread, but likely to fail in a PARFOR loop. But if the seeds are guaranteed to be very different, the random numbers are independent.
I suggest to use 4*32 bit values with high entropy to seed Shuffle for each PARFOR thread. A cheap source for such seeds is (n is number of PARFOR threads) "S=rand(4, n)*2^32". Then seed the iteration with S(:, n).

Willem-Jan de Goeij

Hi, can this version be used for parallel computing (parfor loops)?

I've read that by setting the seed multiple times, let's say once in every parfor loop iteration, it is not guaranteed that the draws are independent (for example rand function).

That's why I'm using the mrg32k3a stream which supports substreams which are guaranteed to be independent.

Would it be difficult to create a version that supports substreams?


@Hessam: Use the index-method to shuffle the rows and keep the columns:
X = rand(3, 4800);
S = X(:, Shuffle(4800, 'index'));


I have the a data Matrix in the format:
the first row is T, the Second M and the 3rd one D.
[2.91000E+02 2.91000E+02 2.91000E+02;
7.47867E-43 7.47867E-43 7.47867E-43;
4.77566E-82 5.30311E-83 0.00000E+00]

It's just a part of my 3*4800 matrix.
Each column should be the same after shuiffelling. I mean just the place of the columns should be changed not the rows

Your help is mostly appreciated


Does anybody need a faster multi-threaded version? It is not worth to improve the program, if nobody needs it.


@Daniel: This and another bug will be fixed in the next update: Shuffle(>2^32, 'index') relies bad values with a tiny chance.

Daniel Shub

I think there is a bug with the seeding. If I start up MATLAB fresh

a = Shuffle(1:100);
Shuffle([], 'seed');
b = Shuffle(1:100);
all(a == b)

does not yield true.

Looking at the c code on line 174 kc is set equal to 7654321. I think this is part of the default seed when Shuffle first launches. On line 1062 kc is set equal to 698769069, which is part of the seed when the seed is reset.

Bruno Luong

Useful addition to FEX

James Tursa

My apologies. I knew about the mixed integer arithmetic bug with lcc 2.4 and had accounted for it in the code I sent you but neglected to point this out (sorry about that). You managed to discover this on your own. Good job!


@James: Now KISS works with LCC 2.4 also, after casting the 32 bit integers to 64 bit explicitely to avoid mixed integer arithmetics. Your code was very helpful to find this solution! Thanks again!

I've added a compiler flag to choose fast&tiny bias or slow&unbiased random integers.

James Tursa

P.S. Did you notice I changed your 698769069UL constant to 698769069ULL? That makes a difference.


@James: Thank you! I've tried KISS with LCC 2.4 and signed long long also, but the replies differ from other compilers for some seeds (seems to depend on the highest bit). Marsaglia has published a KISS generator with pure 32 bit processing also, but I decided not to adjust the C-source to an obviously buggy compiler. I assume, TMW has the power to include an improved version of LCC :-)

James Tursa

As you have discovered, the lcc C compiler that ships with MATLAB has bugs for 64-bit integer arithmetic, particularly the unsigned long long type. Fortunately in some cases you can work around this by using signed 64-bit types. Basically you use unspecified behavior and rely on 2's complement representation for the signed integer types and you also rely on signed integer overflow doing modulo arithmetic. This is the case for the PC, so the technique works on that platform (but is not guaranteed to be portable). I sent you a modified version of your Shuffle.c source code that has modified KISS() and KISS_d() routines that *seem* to work for lcc. Basically, instead of doing the unsigned 64-bit arithmetic I do it in 32-bit pieces as part of 64-bit signed words, and then use a union to get the result back to unsigned 64-bit type. HTH.

Oliver Woodford

Nice and fast. I welcome the ability to specify the output length.

Sebastien PARIS


@Derek: I cannot see any unexpected memory occupation by SHUFFLE.MEX. Example: x = rand(1, 1e7); for i=1:100; x = Shuffle(x); end. I assume, it is a problem in your ShuffleMexPerm32. Please send me this function by email and let us discuss either by email or in the NG.
You can create int32(1:n) efficiently with int32(1):int32(n).

SHUFFLE will be updated soon with new features.

Derek O'Connor

Your Shuffle(Mex) is faster of course, but the memory overhead still exists. Here is the memory profile using Windows Task Manager:

>clear all Tot Mem : 3.5GB
>n=10^9;tic;p = ShuffleMexPerm32(n);toc Tot Mem : 10.8GB
Elapsed time is 109.555864 seconds. Tot Mem : 7.2GB
>clear all Tot Mem : 3.5GB

Apart from no int32 arithmetic, Matlab's memory manager seems to be using more than expected. I don't see any solution to this in the short term. R2020a, perhaps?

Derek O'Connor


Currently SHUFFLE operates on the real part of an array only. If several users need to shuffle the imaginary part also, I could implement this also. So please mail me on demand - thanks. Jan

@Derek: While your M-code ShuffleF3264 suffers from using 32bit variables, Shuffle.c does not as far as I can see. So I'd simply suggest to use the MEX function. The same for the memory overhead.

@Derec and Rob: I agree that returning N values would be useful. But internally the Knuth-shuffle has to process the full array even for this case and it is not faster than creating the whole array at first and cutting it afterwards! SHUFFLE is designed to operate on large arrays. To get e.g. 100 randomly out of 1e7, a completely different algorithm should be used - in a different function!
It mightbe more important to implement a "derangement", which forces each element to have a different location after sorting.

I've found Ragaar Ashnod' Fisher-Yates-Shuffle as M-code today:

Derek O'Connor

There is no real speed difference between i = 2:n and Knuth's i = n:-1:2. I call this ShuffleF because it uses r = floor(rand*i)+1. I call your version ShuffleC because it uses r = ceil(rand*i).

Memory "Overhead"

You are right about p =int32(1:n) creating a temporary DOUBLE. However, I noticed that p = zeros(n,1,'int32') creates no temporary double, but now I have to initialize p=I with for i = 1:n, p(i)=i, end; This is a bit slow but allowed me to create an int32(p) of length n = 3.5x10^9 (45 secs) of size 14GB, without disk swapping. Wonderful!

I was about to ShuffleF this large p vector but caution intruded and I tested it with n = 10^8, with p using just 0.4GB. It took 244 secs, about 10 times what I expected. Here's the reason: int32 'arithmetic' is slow, as this function shows.

> function p = ShuffleF3264(p,bits);
> % Shuffle a vector p using Knuth's Algorithm P,
> % Section 3.4.2, TAOCP, Vol 2, 2nd Ed.
> % This shows the difference between Matlab's
> % normal 64-bit double integers and 32-bit int32 integers.
> %
> % Derek O'Connor, 30 Mar 2010
> %
> n=length(p);
> if bits == 32 % This type propagates below and
> n = int32(n); % causes a big slowdown
> end;
> for i = n:-1:2
> r = floor(rand*i)+1; % random integer between 1 and i
> t = p(r);
> p(r) = p(i); % Swap(p(r),p(i))
> p(i) = t;
> end;
> return; % ShuffleF3264

>> n = 10^8; p = 1:n; % 'normal' Matlab integers (doubles)
>> tic; p = ShuffleF3264(p,64);toc
>> Elapsed time is 30.057698 seconds.
>> tic; p = ShuffleF3264(p,32);toc
>> Elapsed time is 243.815251 seconds.

So, I can generate huge permutations, but I can't do anything with them, except wait for a 1000GHz, liquid-helium-cooled PC.

As Henry Ford said : You can have any type you want, as long as it's a double.

Derek O'Connor

Rob Campbell

The Mex-based shuffle is much faster than randperm although the m-based shuffle is slower in some cases. I can see myself using the Mex-based shuffle in some of my work.

Author was very helpful in getting the Mex code to compile on my Linux box.

I agree with Oliver, it would be useful to have the ability to return only N values.

Oliver Woodford

A parameter allowing only N values to be returned would be good, as this would speed up many sampling-based methods, e.g. RANSAC, which require only a subset of the data each iteration. Also useful would be to work on matrices and return the rows shuffled, and even potentially to name the dimension to shuffle along.

Derek O'Connor

Dear Jan,

You are right, the (Knuth) loop should be << for i = n:-1:2 >>. I was trying to follow Knuth exactly and have got it wrong, twice. I'm not sure why Knuth uses a backward loop, but I seem to remember (1970s) that decrementing and testing a loop counter was faster than incrementing it. I don't think it makes any difference, but I'll check.

On memory overhead, my table was mangled and is hard to read, so let me give one example. For n = 5x10^8, p = int32(1:n) occupies 4x5x10^8 = 2GB. When I ran << p = int32(randperm(n))>> I used Windows Task Manager (WTM) to monitor cpu and memory use and it showed memory use climbing from 2.5GB up to 13GB. Memory overhead = 13-2-2.5 = 8.5GB. When I ran << p = ShufflePermF32(n) >> WTM shows memory use going from 2.5GB to 7.3GB Memory overhead = 7.3-2-2.5 = 2.5GB.

What this means is that with 16GB RAM the largest << p = int32(randperm(n))>> is 2GB or n = 5x10^8, while the largest << p = ShufflePermF32(n) >> is 4GB or n = 10^9. In this case WTM showed memory use rising from 2.5 to 12 for about 10secs, and dropping to 5.5GB. Where or why is this memory overhead used?

Derek O'Connor


Dear Derek: Thanks for the comments! How do you measure "memory overhead"? ShufflePerm32F creates a temporary [1 x n] object of type DOUBLE in: p = int32(1:n). Does this influence the "memory overhead"? I assume that the FOR loop does *not* create another 1:n vector explicitely.

Do you find statistical difference between looping from n=2:n or n:-1:2 (n==1 can be omitted)?

Derek O'Connor


The loop % for i = 1:n % should be % for i = n:-1:1 %

Derek O'Connor

Derek O'Connor

This is a great idea for generating random permutations because, as you say, "While RANDPERM needs 2*LENGTH(X)*8 bytes as temporary memory, SHUFFLE needs just a fixed small number of bytes."

Here are some timing tests for my machine (Trp = time for randperm, Trs = time for Shuffle)

times = TestRandPShufP32('mex')
% n Trp Trs Trp/Trs
% 1e+005 1.68e-002 1.48e-003 1.13e+001
% 1e+006 2.06e-001 1.86e-002 1.10e+001
% 1e+007 2.40e+000 5.01e-001 4.78e+000
% 1e+008 2.71e+001 6.44e+000 4.20e+000
% times = TestRandPShufP32('m')
% n Trp Trs Trp/Trs
% 1e+005 1.64e-002 1.48e-002 1.11e+000
% 1e+006 2.06e-001 1.58e-001 1.30e+000
% 1e+007 2.40e+000 2.40e+000 1.00e+000
% 1e+008 2.73e+001 2.82e+001 9.68e-001
% MATLAB Version (R2008a) 64-bit on a Dell Precision
% 690 2 x Quadcore Xeon E5345 2.33GHz, Windows7 64-bit, MSVC 2008.

The mex version of your Shuffle is much faster than randperm. The non-mexed version is about the same as randperm.

Memory use is lower for Shuttle but there is still a lot of memory 'overhead'.

% p = int32(randperm(n); or p = ShufflePerm32F(n);

% n 10^8 5x10^8 10^9
% size(p) GB 0.4 2.0 4.0
% randperm GB 4.5 13.0 >16
% ShufflePerm32F GB 3.0 7.3 12-->5.5

% ----------------------------------------------------------
% function p = ShufflePerm32F(n)
% % Generate a random permutation using
% % Knuth's Algorithm P, Section 3.4.2
% % TAOCP, Vol 2, 2nd Ed., 1981.
% % Derek O'Connor, 28 Mar 2010
% p = int32(1:n); % n <= 2^32-1 = 2,147,483,647 = 2G
% % 4Bytes x 2G = 8GB
% for i = 1:n
% r = floor(rand*i)+1; % random integer between 1 and i
% t = p(r);
% p(r) = p(i); % Swap(p(r),p(i))
% p(i) = t;
% end
% return; % ShufflePerm32F
% ---------------------------------------------------------

Each test started from a base of about 2GB with a 'clear all' between tests.

Any ideas why there is so much memory overhead?

Derek O'Connor.

MATLAB Release Compatibility
Created with R2009a
Compatible with any release
Platform Compatibility
Windows macOS Linux

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!