From: <HIDDEN>
Newsgroups: comp.soft-sys.matlab
Subject: Re: Generation of random number from Define probability
Date: Thu, 19 Feb 2009 08:52:02 +0000 (UTC)
Organization: The MathWorks, Inc.
Lines: 25
Message-ID: <gnj6ji$9i6$>
References: <gnepo2$18j$> <gnf7ru$fg5$> <gnivii$de5$>
Reply-To: <HIDDEN>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: 1235033522 9798 (19 Feb 2009 08:52:02 GMT)
NNTP-Posting-Date: Thu, 19 Feb 2009 08:52:02 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 1187260
Xref: comp.soft-sys.matlab:519354

"Matt Fig" <> wrote in message <gnivii$de5$>...
> "Roger Stafford" <> wrote in message 
> Please forgive me if this is totally wrong, (I admit I hated probability and statistics back in college) but why wouldn't something simple like this work:
> p = [.679 .179 .129 .013]'; p = p/sum(p); % They must have a sum of one
> n = 20000000;
> [ps,idx] = sort(p);
> ps = cumsum(ps);
> x = rand(n,1);
> E = x;
> E(x<=ps(1)) = idx(1);
> for ii = 2:length(p)  
>    E(x>=ps(ii-1) & x<=ps(ii) ) = idx(ii); 
> end
> It seems pretty fast for large n, and I don't see the reason why this wouldn't work.

  It looks as though it would work all right, Matt, but I don't see the point in doing the sort.  Since all the elements of p must be non-negative, after doing the cumsum you would necessarily have an ascending (or at least a non-descending) sequence in 'ps' anyway, and the inequality checks you have would still work.  Also it wouldn't then be necessary to use 'idx' indexing; you could insert the ii's directly into E, as is.

  This algorithm as well as Jos' 'randp' are order (n*m) where m is the number of elements in p and n the length of x.  For small m like the 4 in this example, one of these would certainly be the best way to go.  However, as I said earlier, when m is large, which is what I had in mind with the for-loop stuff, the order n*log(m) binary search should eventually win out even if carried out iteratively in a for-loop.  Try timing both of them with m = 2^16 along with a large n as well, and see if the binary search method wouldn't come in first.  The binary search for-loop would be taking a lazy 16 trips against a frantic 65,536 for-loop trips for the serial search through ps.  In both cases vectors of length n are being processed each time, though of course in the binary search they are admittedly more complicated.

  When trying to simulate accurately a probability density function f(x) with high resolution, one could conceivably use just such a binary search technique with very high values of m to accomplish the task rather than things like attempting to compute analytically the inverse of the desired cumulative distribution function along with the use of 'rand'.  In a sense one could consider this binary search method a kind of fast table look-up technique as opposed to computing values of the inverse cdf.

Roger Stafford