Random permutation of array elements, CMex: much faster than RANDPERM
Shuffle  Random permutation of array elements
This function is equivalent to X(RANDPERM(LENGTH(X)), but 50% to 85% faster. It uses D.E. Knuth's shuffle algorithm (also called FisherYates) and the cute KISS random number generator (G. Marsaglia). While RANDPERM needs 2*LENGTH(X)*8 bytes as temporary memory, SHUFFLE needs just a fixed small number of bytes.
1. Inplace shuffling: Y = Shuffle(X, Dim)
INPUT:
X: DOUBLE, SINGLE, CHAR, LOGICAL, (U)INT64/32/16/8 array.
Dim: Dimension to operate on. Optional, default: 1st nonsingleton dimension.
OUTPUT:
Y: Array of same type and size as X with shuffled elements.
2. Create a shuffle index: Index = Shuffle(N, 'index', NOut)
This is equivalent to Matlab's RANDPERM, but much faster, if N is large and NOut is small.
INPUT:
N: Integer number.
NOut: The number of output elements. Optional, default: N.
OUTPUT:
Index: [1:NOut] elements of shuffled [1:N] vector in the smallest possible integer type.
3. Derangement index:
Index = Shuffle(N, 'derange', NOut)
Equivalent to the index method, but all Index[i] ~= i. A rejection method is used: Create an index vector until a derangement is gained.
EXAMPLES:
R = Shuffle(1:8) % [8, 1, 2, 6, 4, 3, 5, 7]
R = Shuffle('abcdefg') % 'efbadcg'
R = Shuffle([1:4; 5:8], 2) % [3, 2, 1, 4; 6, 8, 7, 5]
I = Shuffle(8, 'index'); % UINT8([1, 5, 7, 6, 2, 3, 4, 8])
Choose 10 different rows from a 1000 x 100 matrix:
X = rand(1000, 100); Y = X(Shuffle(1000, 'index', 10), :);
Operate on cells or complex arrays:
C = {9, 's', 1:5}; SC = C(Shuffle(numel(C), 'index'));
M = rand(3) + i * rand(3); SM = M(:, Shuffle(size(C, 2), 'index'))
NOTES: There are several other shuffle functions in the FEX. Some use Knuth's method also, some call RANDPERM. This implementation is faster due to calling a compiled MEX file and it has a smaller memory footprint. The KISS random numbers are much better than the RAND() of the Cstandard libs.
Run the unittest TestShuffle to test validity and speed (see screenshot).
Tested: Matlab 6.5, 7.7, 7.8, 32bit, WinXP,
Compiler: LCC 2.4/3.8, BCC 5.5, Open Watcom 1.8, MSVC 2008.
Compatibility to 64 bit, Linux and Mac is assumed.
Precompiled Mex: http://www.nsimon.de/mex
1.4  Some percent faster, bugfix for 64bit index vectors, derangements  Thanks to Derek O'Connor. 

1.3  Bugfix for [1 x 4] seeds.


1.2  Can be compiled with LCC 2.4 (shipped with Matlab) now. Thanks to James Tursa for the helpful ideas! 

1.1  Create index vector with optionally reduced length, dimension to operate on can be specified. *New interface* ! 
Jan Simon (view profile)
@Stephen: The statement about X(RANDPERM(LENGTH(X)) concerns vectors only. For arrays this function Shuffle() operates on a specific dimension and there is no comparable oneline command containing RANDPERM. So for arrays Shuffle() does *not* perform X(randperm(numel(X))) .
Stephen Cobeldick (view profile)
The description is incorrect, because length does not do what the author seems to think it does. For arrays length is not numel!. The description states that it does this (using length):
>> X = randi(9,3)
X =
9 4 3
6 3 7
4 6 1
>> X(randperm(length(X)))
ans =
4 6 9
Whereas the function actually returns _all_ elements, as per numel:
>> X(randperm(numel(X)))
ans =
1 7 6 3 4 6 9 4 3
Ephedyn (view profile)
WillemJan de Goeij (view profile)
Dear Jan,
I've tested calling your function from inside a parfor loop. The results show that repetitions occur.
>> matlabpool local
Starting matlabpool using the 'local' configuration ... connected to 3 labs.
>> shuffleparfor
output =
8 1 3 7 5 2 6 4
2 3 8 7 1 5 6 4
8 1 3 7 5 2 6 4
2 3 8 7 1 5 6 4
8 1 3 7 5 2 6 4
2 3 8 7 1 5 6 4
4 8 7 3 6 2 5 1
4 8 7 3 6 2 5 1
4 8 7 3 6 2 5 1
output = zeros(9,8);
Shuffle(1234567890, 'seed');
parfor ii=1:9
output(ii, :) = Shuffle(8, 'index');
end
WillemJan de Goeij (view profile)
Hi Jan,
In reply to your first comment from 31/07:
Setting the seed to randi([0, 2^321], 1, 4) in every parfor loop would be advisible, I guess?
On my computer 2^321+d becomes zero when casted to uint32_T, with d >= 0.999999762.
I'm setting seeds because I also want to be able to reproduce the results.
WillemJan de Goeij (view profile)
I use parfor loops on my 64 bit computer with Matlab 64 bit.
I see there are only 32 bit mex compilations on your website.
Does anyone have a 64 bit mex compilation of this function?
Jan Simon (view profile)
@WillemJan: Sorry, I'm afraid that my advice was wrong. I do not have the Parallel Computing Toolbox, such that such that I cannot test it. Please check, if using the *same* seed in all PARFOR branchs lead to the *same* permutations.
Jan Simon (view profile)
@WillemJan: The problem for multiple seeding is to get independent seeds. Choosing a seed based on the current time is fair on a single thread, but likely to fail in a PARFOR loop. But if the seeds are guaranteed to be very different, the random numbers are independent.
I suggest to use 4*32 bit values with high entropy to seed Shuffle for each PARFOR thread. A cheap source for such seeds is (n is number of PARFOR threads) "S=rand(4, n)*2^32". Then seed the n.th iteration with S(:, n).
WillemJan de Goeij (view profile)
Hi, can this version be used for parallel computing (parfor loops)?
I've read that by setting the seed multiple times, let's say once in every parfor loop iteration, it is not guaranteed that the draws are independent (for example rand function).
That's why I'm using the mrg32k3a stream which supports substreams which are guaranteed to be independent.
Would it be difficult to create a version that supports substreams?
Jan Simon (view profile)
@Hessam: Use the indexmethod to shuffle the rows and keep the columns:
X = rand(3, 4800);
S = X(:, Shuffle(4800, 'index'));
Hessam (view profile)
I have the a data Matrix in the format:
the first row is T, the Second M and the 3rd one D.
[2.91000E+02 2.91000E+02 2.91000E+02;
7.47867E43 7.47867E43 7.47867E43;
4.77566E82 5.30311E83 0.00000E+00]
It's just a part of my 3*4800 matrix.
Each column should be the same after shuiffelling. I mean just the place of the columns should be changed not the rows
Your help is mostly appreciated
Jan Simon (view profile)
Does anybody need a faster multithreaded version? It is not worth to improve the program, if nobody needs it.
Jan Simon (view profile)
@Daniel: This and another bug will be fixed in the next update: Shuffle(>2^32, 'index') relies bad values with a tiny chance.
Daniel Shub (view profile)
I think there is a bug with the seeding. If I start up MATLAB fresh
a = Shuffle(1:100);
Shuffle([], 'seed');
b = Shuffle(1:100);
all(a == b)
does not yield true.
Looking at the c code on line 174 kc is set equal to 7654321. I think this is part of the default seed when Shuffle first launches. On line 1062 kc is set equal to 698769069, which is part of the seed when the seed is reset.
Bruno Luong (view profile)
Useful addition to FEX
James Tursa (view profile)
My apologies. I knew about the mixed integer arithmetic bug with lcc 2.4 and had accounted for it in the code I sent you but neglected to point this out (sorry about that). You managed to discover this on your own. Good job!
Jan Simon (view profile)
@James: Now KISS works with LCC 2.4 also, after casting the 32 bit integers to 64 bit explicitely to avoid mixed integer arithmetics. Your code was very helpful to find this solution! Thanks again!
I've added a compiler flag to choose fast&tiny bias or slow&unbiased random integers.
James Tursa (view profile)
P.S. Did you notice I changed your 698769069UL constant to 698769069ULL? That makes a difference.
Jan Simon (view profile)
@James: Thank you! I've tried KISS with LCC 2.4 and signed long long also, but the replies differ from other compilers for some seeds (seems to depend on the highest bit). Marsaglia has published a KISS generator with pure 32 bit processing also, but I decided not to adjust the Csource to an obviously buggy compiler. I assume, TMW has the power to include an improved version of LCC :)
James Tursa (view profile)
As you have discovered, the lcc C compiler that ships with MATLAB has bugs for 64bit integer arithmetic, particularly the unsigned long long type. Fortunately in some cases you can work around this by using signed 64bit types. Basically you use unspecified behavior and rely on 2's complement representation for the signed integer types and you also rely on signed integer overflow doing modulo arithmetic. This is the case for the PC, so the technique works on that platform (but is not guaranteed to be portable). I sent you a modified version of your Shuffle.c source code that has modified KISS() and KISS_d() routines that *seem* to work for lcc. Basically, instead of doing the unsigned 64bit arithmetic I do it in 32bit pieces as part of 64bit signed words, and then use a union to get the result back to unsigned 64bit type. HTH.
Oliver Woodford (view profile)
Nice and fast. I welcome the ability to specify the output length.
Sebastien PARIS (view profile)
Jan Simon (view profile)
@Derek: I cannot see any unexpected memory occupation by SHUFFLE.MEX. Example: x = rand(1, 1e7); for i=1:100; x = Shuffle(x); end. I assume, it is a problem in your ShuffleMexPerm32. Please send me this function by email and let us discuss either by email or in the NG.
You can create int32(1:n) efficiently with int32(1):int32(n).
SHUFFLE will be updated soon with new features.
Derek O'Connor (view profile)
Jan,
Your Shuffle(Mex) is faster of course, but the memory overhead still exists. Here is the memory profile using Windows Task Manager:
>clear all Tot Mem : 3.5GB
>n=10^9;tic;p = ShuffleMexPerm32(n);toc Tot Mem : 10.8GB
Elapsed time is 109.555864 seconds. Tot Mem : 7.2GB
>clear all Tot Mem : 3.5GB
Apart from no int32 arithmetic, Matlab's memory manager seems to be using more than expected. I don't see any solution to this in the short term. R2020a, perhaps?
Derek O'Connor
Jan Simon (view profile)
Currently SHUFFLE operates on the real part of an array only. If several users need to shuffle the imaginary part also, I could implement this also. So please mail me on demand  thanks. Jan
@Derek: While your Mcode ShuffleF3264 suffers from using 32bit variables, Shuffle.c does not as far as I can see. So I'd simply suggest to use the MEX function. The same for the memory overhead.
@Derec and Rob: I agree that returning N values would be useful. But internally the Knuthshuffle has to process the full array even for this case and it is not faster than creating the whole array at first and cutting it afterwards! SHUFFLE is designed to operate on large arrays. To get e.g. 100 randomly out of 1e7, a completely different algorithm should be used  in a different function!
It mightbe more important to implement a "derangement", which forces each element to have a different location after sorting.
I've found Ragaar Ashnod' FisherYatesShuffle as Mcode today:
http://www.mathworks.com/matlabcentral/fileexchange/25564
and Jos' RANDPERMFULL:
http://www.mathworks.com/matlabcentral/fileexchange/23257
Derek O'Connor (view profile)
Jan,
There is no real speed difference between i = 2:n and Knuth's i = n:1:2. I call this ShuffleF because it uses r = floor(rand*i)+1. I call your version ShuffleC because it uses r = ceil(rand*i).
Memory "Overhead"
You are right about p =int32(1:n) creating a temporary DOUBLE. However, I noticed that p = zeros(n,1,'int32') creates no temporary double, but now I have to initialize p=I with for i = 1:n, p(i)=i, end; This is a bit slow but allowed me to create an int32(p) of length n = 3.5x10^9 (45 secs) of size 14GB, without disk swapping. Wonderful!
I was about to ShuffleF this large p vector but caution intruded and I tested it with n = 10^8, with p using just 0.4GB. It took 244 secs, about 10 times what I expected. Here's the reason: int32 'arithmetic' is slow, as this function shows.
> function p = ShuffleF3264(p,bits);
> % Shuffle a vector p using Knuth's Algorithm P,
> % Section 3.4.2, TAOCP, Vol 2, 2nd Ed.
> % This shows the difference between Matlab's
> % normal 64bit double integers and 32bit int32 integers.
> %
> % Derek O'Connor, 30 Mar 2010
> %
> n=length(p);
> if bits == 32 % This type propagates below and
> n = int32(n); % causes a big slowdown
> end;
>
> for i = n:1:2
> r = floor(rand*i)+1; % random integer between 1 and i
> t = p(r);
> p(r) = p(i); % Swap(p(r),p(i))
> p(i) = t;
> end;
> return; % ShuffleF3264
>> n = 10^8; p = 1:n; % 'normal' Matlab integers (doubles)
>> tic; p = ShuffleF3264(p,64);toc
>> Elapsed time is 30.057698 seconds.
>> tic; p = ShuffleF3264(p,32);toc
>> Elapsed time is 243.815251 seconds.
So, I can generate huge permutations, but I can't do anything with them, except wait for a 1000GHz, liquidheliumcooled PC.
As Henry Ford said : You can have any type you want, as long as it's a double.
Derek O'Connor
Rob Campbell (view profile)
The Mexbased shuffle is much faster than randperm although the mbased shuffle is slower in some cases. I can see myself using the Mexbased shuffle in some of my work.
Author was very helpful in getting the Mex code to compile on my Linux box.
I agree with Oliver, it would be useful to have the ability to return only N values.
Oliver Woodford (view profile)
A parameter allowing only N values to be returned would be good, as this would speed up many samplingbased methods, e.g. RANSAC, which require only a subset of the data each iteration. Also useful would be to work on matrices and return the rows shuffled, and even potentially to name the dimension to shuffle along.
Derek O'Connor (view profile)
Dear Jan,
You are right, the (Knuth) loop should be << for i = n:1:2 >>. I was trying to follow Knuth exactly and have got it wrong, twice. I'm not sure why Knuth uses a backward loop, but I seem to remember (1970s) that decrementing and testing a loop counter was faster than incrementing it. I don't think it makes any difference, but I'll check.
On memory overhead, my table was mangled and is hard to read, so let me give one example. For n = 5x10^8, p = int32(1:n) occupies 4x5x10^8 = 2GB. When I ran << p = int32(randperm(n))>> I used Windows Task Manager (WTM) to monitor cpu and memory use and it showed memory use climbing from 2.5GB up to 13GB. Memory overhead = 1322.5 = 8.5GB. When I ran << p = ShufflePermF32(n) >> WTM shows memory use going from 2.5GB to 7.3GB Memory overhead = 7.322.5 = 2.5GB.
What this means is that with 16GB RAM the largest << p = int32(randperm(n))>> is 2GB or n = 5x10^8, while the largest << p = ShufflePermF32(n) >> is 4GB or n = 10^9. In this case WTM showed memory use rising from 2.5 to 12 for about 10secs, and dropping to 5.5GB. Where or why is this memory overhead used?
Derek O'Connor
Jan Simon (view profile)
Dear Derek: Thanks for the comments! How do you measure "memory overhead"? ShufflePerm32F creates a temporary [1 x n] object of type DOUBLE in: p = int32(1:n). Does this influence the "memory overhead"? I assume that the FOR loop does *not* create another 1:n vector explicitely.
Do you find statistical difference between looping from n=2:n or n:1:2 (n==1 can be omitted)?
Derek O'Connor (view profile)
CORRECTION
The loop % for i = 1:n % should be % for i = n:1:1 %
Derek O'Connor
Derek O'Connor (view profile)
This is a great idea for generating random permutations because, as you say, "While RANDPERM needs 2*LENGTH(X)*8 bytes as temporary memory, SHUFFLE needs just a fixed small number of bytes."
Here are some timing tests for my machine (Trp = time for randperm, Trs = time for Shuffle)
times = TestRandPShufP32('mex')
% n Trp Trs Trp/Trs
% 1e+005 1.68e002 1.48e003 1.13e+001
% 1e+006 2.06e001 1.86e002 1.10e+001
% 1e+007 2.40e+000 5.01e001 4.78e+000
% 1e+008 2.71e+001 6.44e+000 4.20e+000
%
% times = TestRandPShufP32('m')
% n Trp Trs Trp/Trs
% 1e+005 1.64e002 1.48e002 1.11e+000
% 1e+006 2.06e001 1.58e001 1.30e+000
% 1e+007 2.40e+000 2.40e+000 1.00e+000
% 1e+008 2.73e+001 2.82e+001 9.68e001
%
% MATLAB Version 7.6.0.2109 (R2008a) 64bit on a Dell Precision
% 690 2 x Quadcore Xeon E5345 2.33GHz, Windows7 64bit, MSVC 2008.
The mex version of your Shuffle is much faster than randperm. The nonmexed version is about the same as randperm.
Memory use is lower for Shuttle but there is still a lot of memory 'overhead'.
% p = int32(randperm(n); or p = ShufflePerm32F(n);
% n 10^8 5x10^8 10^9
%
% size(p) GB 0.4 2.0 4.0
% randperm GB 4.5 13.0 >16
% ShufflePerm32F GB 3.0 7.3 12>5.5
% 
% function p = ShufflePerm32F(n)
% % Generate a random permutation using
% % Knuth's Algorithm P, Section 3.4.2
% % TAOCP, Vol 2, 2nd Ed., 1981.
% % Derek O'Connor, 28 Mar 2010
%
% p = int32(1:n); % n <= 2^321 = 2,147,483,647 = 2G
% % 4Bytes x 2G = 8GB
% for i = 1:n
% r = floor(rand*i)+1; % random integer between 1 and i
% t = p(r);
% p(r) = p(i); % Swap(p(r),p(i))
% p(i) = t;
% end
%
% return; % ShufflePerm32F
% 
Each test started from a base of about 2GB with a 'clear all' between tests.
Any ideas why there is so much memory overhead?
Derek O'Connor.