Generating Random Numbers on a GPU

This example shows how to switch between the different random number generators that are supported on the GPU.

Random numbers form a key part of many simulation or estimation algorithms. Typically, these numbers are generated using the functions rand, randi, and randn. Parallel Computing Toolbox™ provides three corresponding functions for generating random numbers directly on a GPU: rand, randi, and randn. These functions can use one of several different number generation algorithms.

d = gpuDevice;
fprintf("This example is run on a " + d.Name + " GPU.")

This example is run on a GeForce GTX 1080 GPU.

Discovering the GPU random number generators

The function parallel.gpu.RandStream.list provides a short description of the available generators.

parallel.gpu.RandStream.list

 
The following random number generator algorithms are available:
 
MRG32K3A:         Combined multiple recursive generator (supports parallel streams)
Philox4x32_10:    Philox 4x32 generator with 10 rounds (supports parallel streams)
Threefry4x64_20:  Threefry 4x64 generator with 20 rounds (supports parallel streams)

Each of these generators has been designed with parallel use in mind, providing multiple independent streams of random numbers. However, they each have some advantages and disadvantages:

CombRecursive (also known as MRG32k3a): This generator was introduced in 1999 and has been widely tested and used.
Philox (also known as Philox4x32_10): New generator introduced in 2011, specifically designed for high performance in highly parallel systems such as GPUs.
Threefry (also known as Threefry4x64_20): New generator introduced in 2011 based on the existing cryptographic ThreeFish algorithm, which is widely tested and used. This generator was designed to give good performance in highly parallel systems such as GPUs. This is the default generator for GPU calculations.

The three generators available on the GPU are also available for use on the CPU in MATLAB®. The MATLAB generators have the same name and produce identical results given the same initial state. This is useful when you want to produce the same sets of random numbers on both the GPU and the CPU. For more information, see Random Number Streams on a GPU.

All of these generators pass the standard TestU01 test suite [1].

Changing the default random number generator

The function gpurng can store and reset the generator state for the GPU. You can also use gpurng to switch between the different generators that are provided. Before changing the generator, store the existing state so that it can be restored at the end of these tests.

oldState = gpurng;

gpurng(0, "Philox4x32-10");
disp(gpurng)

     Type: 'philox'
     Seed: 0
    State: [7×1 uint32]

Generating uniformly distributed random numbers

Uniformly distributed random numbers are generated on the GPU using either rand, or randi. In performance terms, these two functions behave very similarly and only rand is measured here. gputimeit is used to measure the performance to ensure accurate timing results, automatically calling the function many times and correctly dealing with synchronization and other timing issues.

To compare the performance of the different generators, use rand to generate a large number of random numbers on the GPU using each generator. In the following code, rand generates $1 0^{7}$ random numbers and is called 100 times for each generator. Each run is timed using gputimeit. Generating large samples of random numbers can take several minutes. The results indicate a performance comparison between the three random number generators available on the GPU.

generators = ["Philox","Threefry","CombRecursive"];
gputimesU = nan(100,3);
for g=1:numel(generators)
    % Set the generator
    gpurng(0, generators{g});
    % Perform calculation 100 times, timing the generator 
    for rep=1:100
        gputimesU(rep,g) = gputimeit(@() rand(10000,1000,"gpuArray"));
    end
end

% Plot the results
figure
hold on
histogram(gputimesU(:,1),"BinWidth",1e-4);
histogram(gputimesU(:,2),"BinWidth",1e-4);
histogram(gputimesU(:,3),"BinWidth",1e-4)

legend(generators)
xlabel("Time to generate 10^7 random numbers (sec)")
ylabel("Frequency")
title("Generating samples in U(0,1) using " + d.Name)
hold off

The newer generators Threefry and Philox have similar perfomance. Both are faster than CombRecursive.

Generating normally distributed random numbers

Many simulations rely on perturbations sampled from a normal distribution. Similar to the uniform test, use randn to compare the performance of the three generators when generating normally distributed random numbers. Generating large samples of random numbers can take several minutes.

generators = ["Philox","Threefry","CombRecursive"];
gputimesN = nan(100,3);
for g=1:numel(generators)
    % Set the generator
    gpurng(0, generators{g});
    % Perform calculation 100 times, timing the generator 
    for rep=1:100
        gputimesN(rep,g) = gputimeit(@() randn(10000,1000,"gpuArray"));
    end
end

% Plot the results
figure
hold on
histogram(gputimesN(:,1),"BinWidth",1e-4);
histogram(gputimesN(:,2),"BinWidth",1e-4)
histogram(gputimesN(:,3),"BinWidth",1e-4)
legend(generators)
xlabel("Time to generate 10^7 random numbers (sec)")
ylabel("Frequency")
title("Generating samples in N(0,1) using " + d.Name)
hold off

Once again, the results indicate that the Threefry and Philox generators perform similarly and are both notably faster than CombRecursive. The extra work required to produce normally distributed values reduces the rate at which values are produced by each of the generators.

Before finishing, restore the original generator state.

gpurng(oldState);

Conclusion

In this example, the three GPU random number generators are compared. The exact results vary depending on your GPU and computing platform. Each generator provides some advantages (+) and has some caveats (-).

Threefry

(+) Fast
(+) Based on well-known and well-tested Threefish algorithm
(-) Relatively new in real-world usage

Philox

(+) Fast
(-) Relatively new in real-world usage

CombRecursive

(+) Long track record in real-world usage
(-) Slowest

References

[1] L'Ecuyer, P., and R. Simard. "TestU01: A C library for empirical testing of random number generators." ACM Transactions on Mathematical Software. Vol. 33, No. 4, 2007, article 22.