MATLAB Answers

0

How to efficiently generate a random integer within a range from an arbitrary probability distribution

Asked by Paolo Binetti on 7 Jan 2017
Latest activity Commented on by the cyclist
on 8 Jan 2017
I need to generate a random integer within a range from an arbitrary probability distribution, within a loop of 100000 iterations. My implementation works, but I am not sure it is mathematically clean, and it takes forever:
pdf = [ 0.9 0.3 0.003 0.1 0.07 0.0005 0.003 0.15 0.009 0.08 ]; % discrete prob distrib function
cdf = cumsum(pdf); % cumulative distribution function
cdf = cdf / max(cdf); cdf(1) = 0; % normalization
index = ceil(interp1(cdf, [1:numel(pdf)], rand(1)))
Notice that the pdf above is just an example: my actual case is a vector of about 500 numbers.
Here is a different solution, which seems mathematically cleaner, but does not work for my overall problem, and is just as slow as above:
pdf = [ 0.9 0.3 0.003 0.1 0.07 0.0005 0.003 0.15 0.009 0.08 ]; % discrete prob distrib function
cdf = cumsum(pdf); % cumulative distribution function
cdf = cdf - min(cdf); cdf = cdf / max(cdf); % normalization
index = round(interp1(cdf, [1:numel(pdf)], rand(1)))
Is there a more efficient/correct way to do this?

  0 Comments

Sign in to comment.

2 Answers

Answer by the cyclist
on 8 Jan 2017
Edited by the cyclist
on 8 Jan 2017
 Accepted Answer

I think that
index = sum(rand()>cdf)+1;
will be much faster than using interp1 as you do, and will give the same result.

  4 Comments

Show 1 older comment
Your solution seems to work perfectly for my problem. It is about 7 times faster, which is a major improvement.
I have tried "randsample" from the Octave statistics package (should be more or less equivalent to the Matlab function) with the following syntax:
index = randsample(numel(pdf),1,true,pdf);
but it is even slower than my initial solution. Is this normal?
And you are totally right about the wrong way I used to to start cdf from zero. I had in mind your way, but did not use it because I thought it would imply never selecting the end index ... but that's not a problem with your implementation, right?
I double-checked by generating 100,000 samples, and the results seem right.
Regarding speed ...
I get roughly equivalent results between my solution and randsample. Note that both of these solutions can be vectorized:
index = sum(rand(N,1)>cdf,2)';
and
index = randsample(1:numel(pdf),N,true,pdf);
These will both generate N = 100,000 samples in a few milliseconds.
I did not try to vectorize your solution, but as it currently stands, it took 5 seconds to generate that many.

Sign in to comment.


Answer by the cyclist
on 8 Jan 2017

Do you have the Statistics and Machine Learning Toolbox? If so, you can do this with the randsample command.

  1 Comment

Thank you. No, I do not have this toolbox but I'll see if I can get it.

Sign in to comment.