Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
What is the probability that random integers sum to a given value?

Subject: What is the probability that random integers sum to a given value?

From: David Heslop

Date: 30 Jun, 2009 16:21:02

Message: 1 of 24

Can anybody suggest an efficient method to approach this problem?

Choose W unique random integers drawn from 1 to N, what is the probability that their sum will equal a given value? Typically both N and W are large, around 500 and 50 respectively. Currently I'm approaching this problem using a simple Monte Carlo approach to determine probability distributions for a range of sum values for given N and W, but the accuracy is not good enough. Can anyone suggest an alternative approach which is better than simply performing more Monte-Carlo trials,

thanks

Dave

Subject: What is the probability that random integers sum to a given value?

From: John D'Errico

Date: 30 Jun, 2009 17:02:01

Message: 2 of 24

"David Heslop" <david_heslop@xyz.com> wrote in message <h2de1d$sgb$1@fred.mathworks.com>...
> Can anybody suggest an efficient method to approach this problem?
>
> Choose W unique random integers drawn from 1 to N, what is the probability that their sum will equal a given value? Typically both N and W are large, around 500 and 50 respectively. Currently I'm approaching this problem using a simple Monte Carlo approach to determine probability distributions for a range of sum values for given N and W, but the accuracy is not good enough. Can anyone suggest an alternative approach which is better than simply performing more Monte-Carlo trials,
>

With or without replacement?

If the sampling is with replacement, then the
central limit theorem applies, and for a sample
size of 50, will be reasonably accurate in
expectation.

John

Subject: What is the probability that random integers sum to a given value?

From: Bruno Luong

Date: 30 Jun, 2009 21:20:02

Message: 3 of 24

"David Heslop" <david_heslop@xyz.com> wrote in message <h2de1d$sgb$1@fred.mathworks.com>...
> Can anybody suggest an efficient method to approach this problem?
>
> Choose W unique random integers drawn from 1 to N, what is the probability that their sum will equal a given value? Typically both N and W are large, around 500 and 50 respectively. Currently I'm approaching this problem using a simple Monte Carlo approach to determine probability distributions for a range of sum values for given N and W, but the accuracy is not good enough. Can anyone suggest an alternative approach which is better than simply performing more Monte-Carlo trials,
>
> thanks
>
> Dave

m=6; % vector length (w)
n=15; % define n such that 0<=w<=n

% engine
% possible sum vector
s = (0:n*m);
% Compute probabillity
d = zeros(1,m*(n-1)+1);
i1 = n-m+2;
i = cumsum([i1 repmat(n+1,1,m-2)]);
d(i) = arrayfun(@(k) (-1)^(k-1)*nchoosek(m,k-1), 2:m);
c = d;
for k=m:-1:1
    c = cumsum([nchoosek(m-1,k-1) c]);
end
P = c / sum(c);

% Graphic
figure;
plot(s, P, '-o');
xlabel('sum');
ylabel('P(sum)');
title(sprintf('vector length = %d, n = %d', m, n));

Subject: What is the probability that random integers sum to a given value?

From: Bruno Luong

Date: 30 Jun, 2009 21:29:01

Message: 4 of 24

Note that I assume W draw from *0* to n. It is trivial use my method after shifting by 1.

Bruno

Subject: What is the probability that random integers sum to a given value?

From: Bruno Luong

Date: 30 Jun, 2009 21:48:01

Message: 5 of 24

A slighly cleaner code

m=9; % vector length (w)
n=20; % define n such that 0<=w<=n

% engine
% possible sum vector
s = (0:n*m);
% Compute probabillity
c = zeros(1,m*(n-1)+1);
i = cumsum([n-m+2 repmat(n+1,1,m-2)]);
c(i) = arrayfun(@(k) (-1)^k*nchoosek(m,k), 1:m-1);
for k=0:m-1
    c = cumsum([nchoosek(m-1,k) c]);
end
P = c / sum(c);

% Graphic
figure;
plot(s, P, '-o');
xlabel('sum');
ylabel('P(sum)');
title(sprintf('vector length = %d, n = %d', m, n));

% Bruno

Subject: What is the probability that random integers sum to a given value?

From: Bruno Luong

Date: 1 Jul, 2009 06:14:01

Message: 6 of 24

Also important note: my method is correct for n>=m (actually m-1). So far I fail to find a method of calculation for n<m.

Bruno

Subject: What is the probability that random integers sum to a given value?

From: David Heslop

Date: 1 Jul, 2009 07:02:01

Message: 7 of 24

Thanks for your input, the sampling it assumed to be without replacement. The problem derives from attempting to form significance levels for rank statistics where replacement is not allowed,

thanks,
Dave

Subject: What is the probability that random integers sum to a given value?

From: Bruno Luong

Date: 1 Jul, 2009 07:55:02

Message: 8 of 24

"David Heslop" <david_heslop@xyz.com> wrote in message <h2f1l9$4fo$1@fred.mathworks.com>...
> The sampling it assumed to be without replacement.

Woah, John was right to ask such question. To me if it's not specified, I understand W is independent random variables of uniform distribution, i.e., "with replacement". The calculation changes a lot between these two cases (it's OK for your monte-carlo calculation, but think about those who tried to derive the formula).

Bruno

Subject: What is the probability that random integers sum to a given value?

From: David Heslop

Date: 1 Jul, 2009 08:14:01

Message: 9 of 24

Hi Bruno,
sorry if this led to confusion, I hoped that I had made this aspect explicit in the original problem when stating that the integers had to be "unique". I must have got mixed up with my stats terminology,
Dave

Subject: What is the probability that random integers sum to a given value?

From: tristram.scott@ntlworld.com (Tristram Scott)

Date: 1 Jul, 2009 09:35:17

Message: 10 of 24

David Heslop <david_heslop@xyz.com> wrote:
> Hi Bruno,
> sorry if this led to confusion, I hoped that I had made this aspect
> explicit in the original problem when stating that the integers had to be
> "unique". I must have got mixed up with my stats terminology,
> Dave

Unique does imply that the sampling is without replacement, but it would
have avoided some confusion if you had been explicit on this.

In practice, though, I can't see it making much difference at all for the
problem you describe. The exception is going to be right at the tails of
the distribution.

For the orders of magnitude that you suggest, integers between 1 and 500,
sampling 50 times, I would go with John's suggestion as use the central
limit theorem.

If you think of a very small number of samples, 2, then what you have is a
discrete triangular distribution. You have 500^2 possible combinations,
each equally likely. Of those, only 500 are the non-unique ones, and so
not allowed. Now, if your range were integers between 1 and 10,
replacement might be relevant, but for the orders of magnitude you
describe, I don't think it is.

n = 500;
x0 = (1:n)';
p0 = ones(n,1)/n;
x2 = (2:2*n)'; % The sums
p2 = conv(p0,p0); % The probabilities
x2a = 2*x0; % The sums where we used the same number twice
p2a = p0.^2; % The probabilities
plot(x2,p2,x2a,p2a,'.')
grid on

--
Dr Tristram J. Scott
Energy Consultant

Subject: What is the probability that random integers sum to a given value?

From: John D'Errico

Date: 1 Jul, 2009 10:29:02

Message: 11 of 24

tristram.scott@ntlworld.com (Tristram Scott) wrote in message <p5G2m.8$AX1.0@newsfe20.ams2>...
> David Heslop <david_heslop@xyz.com> wrote:
> > Hi Bruno,
> > sorry if this led to confusion, I hoped that I had made this aspect
> > explicit in the original problem when stating that the integers had to be
> > "unique". I must have got mixed up with my stats terminology,
> > Dave
>
> Unique does imply that the sampling is without replacement, but it would
> have avoided some confusion if you had been explicit on this.
>
> In practice, though, I can't see it making much difference at all for the
> problem you describe. The exception is going to be right at the tails of
> the distribution.

I thought that the whole thing would hinge on
the replacement issue. The question is how bad
will it be if we ignore replacement? The birthday
paradox suggests that the probability of a
replacement conflict is high if we use one
method when the other applies. On a population
size of 500, a sample of 50 that is done with
replacement is very likely to have one or more
matches.

On a population size of 500 and a sample size of
50, sampling with replacement will yield a very
normal (Gaussian) looking curve. The expected
value of the sum WITH replacement will be

  k*(n+1)/2

for samples of size k in a population of n. So for
n = 500 and k = 50, this is

n = 500;k = 50;
k*(n+1)/2
ans =
       12525

Lets try a sample without replacement.

[tags,tags] = sort(rand(10000,500),2);
S = sum(tags(:,1:50),2);

Over 10000 iterations of the process, the distribution
of the sum looks quite normally distributed.

hist(S,100)

The skewness should be zero for a normal, and
the Kurtosis should be 3.

kurtosis(S)
ans =
       2.9686

skewness(S)
ans =
    0.0044813

As well, the mean of our samples should be 12525
for the samples with replacement, and this is included
in the confidence intervals on our samples without
replacement.

[MUHAT,SIGMAHAT,MUCI,SIGMACI] = normfit(S)
MUHAT =
        12513
SIGMAHAT =
       966.09
MUCI =
        12494
        12532
SIGMACI =
       952.89
       979.67

So it looks to me after a quick test that it matters
relatively little if you sample with or without
replacement, at least if your population is as large
as 500.

John

Subject: What is the probability that random integers sum to a given value?

From: David Heslop

Date: 1 Jul, 2009 10:56:01

Message: 12 of 24

Dear All,
It is an interesting point that the replacement will have little effect given a large enough value of N. As Tristram mentioned the main influence will be in the tails of the distribution, unfortunately the tails are the most important issue for me because I’m trying to determine critical values for a statistical test. It may be that overall the solutions you have offered are still an improvement over a MC derived distribution which respects the replacement issue.
Thanks for your help,
Dave

Subject: What is the probability that random integers sum to a given value?

From: John D'Errico

Date: 1 Jul, 2009 11:16:01

Message: 13 of 24

"David Heslop" <david_heslop@xyz.com> wrote in message <h2ffc1$1t0$1@fred.mathworks.com>...
> Dear All,
> It is an interesting point that the replacement will have little effect given a large enough value of N. As Tristram mentioned the main influence will be in the tails of the distribution, unfortunately the tails are the most important issue for me because I’m trying to determine critical values for a statistical test. It may be that overall the solutions you have offered are still an improvement over a MC derived distribution which respects the replacement issue.
> Thanks for your help,
> Dave

I'll claim that for a population of 500, with samples
of size 50 taken without replacement, that the
normal approximation will be far better than
Monte Carlo, even if you look into the tails of
the distribution.

How many hits do we expect to see if we sample
with replacement?

X = sort(ceil(rand(100000,50)*500),2);

mean(sum(diff(X,[],2)==0,2))
ans =
       2.3659

Given that the exact distribution will be quite
difficult to compute for the case without
replacement, I'd be strongly tempted to use the
central limit theorem here.

John

Subject: What is the probability that random integers sum to a given value?

From: David Heslop

Date: 1 Jul, 2009 12:59:02

Message: 14 of 24

Thanks John, that is an extremely demonstrative piece of code,
Dave

Subject: What is the probability that random integers sum to a given value?

From: dpb

Date: 1 Jul, 2009 12:59:50

Message: 15 of 24

John D'Errico wrote:
> "David Heslop" <david_heslop@xyz.com> wrote in message <h2ffc1$1t0$1@fred.mathworks.com>...
>> Dear All,
>> It is an interesting point that the replacement will have little
effect given a large enough value of N. As Tristram mentioned the main
influence will be in the tails of the distribution, unfortunately the
tails are the most important issue for me because I’m trying to
determine critical values for a statistical test. ...

I'd ask how _FAR_ into the tails are you concerned about?

--

Subject: What is the probability that random integers sum to a given value?

From: Bruno Luong

Date: 1 Jul, 2009 13:52:01

Message: 16 of 24

It seems very difficult to get the exact probability for non-replacement case. I would go with John : try to ignore the non-replacement assumption. The question is how much error we make by this approximation. In any case I'm with John, it's probably a better way than Monte Carlo simulation.

Bruno

Subject: What is the probability that random integers sum to a given value?

From: dpb

Date: 1 Jul, 2009 13:56:39

Message: 17 of 24

Bruno Luong wrote:
> It seems very difficult to get the exact probability for
> non-replacement case. I would go with John : try to ignore the
> non-replacement assumption. The question is how much error we make by
> this approximation. In any case I'm with John, it's probably a better
> way than Monte Carlo simulation.
>

That's why I asked about how far out into the tails does Dave need/want
estimates for?

Typical 95/99 CL's I'd think w/o actually looking at it in detail would
be pretty darn good w/ CLT; as you approach 99.5 and on then will become
more of an issue obviously. Of course, there's where simple MC will
breakdown, too.

--

Subject: What is the probability that random integers sum to a given value?

From: Bruno Luong

Date: 1 Jul, 2009 14:25:03

Message: 18 of 24

dpb <none@non.net> wrote in message <h2fqan$ooa$1@news.eternal-september.org>...

>
> Typical 95/99 CL's I'd think w/o actually looking at it in detail would
> be pretty darn good w/ CLT; as you approach 99.5 and on then will become
> more of an issue obviously. Of course, there's where simple MC will
> breakdown, too.

There is a fundamental difference IMHO. The approximate assumption (replacing vs non-replacing) will introduce a systematic error that stays right there (could be very small) and cannot go away. The monte-carlo approach - in theory - would be able to provide the result as good as we wish, since the error is purely statistics.

Bruno

Subject: What is the probability that random integers sum to a given value?

From: David Heslop

Date: 1 Jul, 2009 14:36:01

Message: 19 of 24

The question of how far I need to go into the tails is a bit of an open one. I'm using this approach in an attempt to identify anomalous events within a time series. Rather than saying that a given event passes a certain threshold, for example 95 or 99, I have been trying to assign specific probabilities. This means that sometimes I'm visiting the extremes of the tails and often with the MC approach a zero probability is returned.

Dave

Subject: What is the probability that random integers sum to a given value?

From: dpb

Date: 1 Jul, 2009 14:43:10

Message: 20 of 24

David Heslop wrote:
> The question of how far I need to go into the tails is a bit of an
> open one. I'm using this approach in an attempt to identify anomalous
> events within a time series. Rather than saying that a given event
> passes a certain threshold, for example 95 or 99, I have been trying
> to assign specific probabilities. This means that sometimes I'm
> visiting the extremes of the tails and often with the MC approach a
> zero probability is returned.
>
> Dave

That's tougher then...as Bruno notes, the asymptotic solution will have
a bias, the magnitude (and direction) of which is unknown w/o additional
work (not, of course, that that is news to you :) ).

The error as you extend into the further tails, while small absolute
will likely become larger and larger in relation to the value.

W/O having time to work on it, only other suggestion I'd have would be
perhaps some stratified sampling techniques to help in a few really
extensive MC simulations to see if can quantify magnitude and direction
of the bias at a few points as the tail goes farther out. I know that's
not much direct help...

--

Subject: What is the probability that random integers sum to a given value?

From: tristram.scott@ntlworld.com (Tristram Scott)

Date: 1 Jul, 2009 15:44:49

Message: 21 of 24

David Heslop <david_heslop@xyz.com> wrote:
> The question of how far I need to go into the tails is a bit of an open
> one. I'm using this approach in an attempt to identify anomalous events
> within a time series. Rather than saying that a given event passes a
> certain threshold, for example 95 or 99, I have been trying to assign
> specific probabilities. This means that sometimes I'm visiting the extremes
> of the tails and often with the MC approach a zero probability is
> returned.
>

If the number of observations is large enough for CLT, use that. If it
isn't, calculate the exact PDF for the problem with replacement, and use
that. You can calculate the exact PDF by a repeated convolution.

If Monte Carlo gives you a (very close to) zero probability, go with that.
If you want to know bounds on the error, do repeated Monte Carlo runs, and
calculate an estimate of the value associated with a given probability.
You can do this many times to get a confidence interval for the value.

--
Dr Tristram J. Scott
Energy Consultant

Subject: What is the probability that random integers sum to a given value?

From: tristram.scott@ntlworld.com (Tristram Scott)

Date: 1 Jul, 2009 16:35:17

Message: 22 of 24

Tristram Scott <tristram.scott@ntlworld.com> wrote:
> David Heslop <david_heslop@xyz.com> wrote:
>> The question of how far I need to go into the tails is a bit of an open
>> one. I'm using this approach in an attempt to identify anomalous events
>> within a time series. Rather than saying that a given event passes a
>> certain threshold, for example 95 or 99, I have been trying to assign
>> specific probabilities. This means that sometimes I'm visiting the extremes
>> of the tails and often with the MC approach a zero probability is
>> returned.
>>
>
[snip]

I had another look at the snippet of code than John posted, and I am less
convinced than I was before.

X = sort(ceil(rand(1000000,50)*500),2); % Note the bigger sample size.
dX = diff(X,[],2);
sX = sum(dX==0,2);
hist(sX,0:12)

% Looks very much like poisson pdf. Presumably someone else expected that.
ff = find(~sX); % Rows where there were no repeats.
n1 = sum(X,2);
n2 = sum(X(ff,:),2);
xn1 = (1:length(n1))/length(n1);
xn2 = (1:length(n2))/length(n2);
sn1 = sort(n1);
sn2 = sort(n2);
plot(sn1,xn1,sn2,xn2) % Cumulative density functions
grid on

>> length(n2)/length(n1)

ans =

    0.0793

So, n1 is the observed sums with replacement. n2 is the subset of those
which would have been valid without replacement. Only 8% get through. The
cumulative density plots are very similar, but you can spot the difference
by eye.

My conclusion at this point is that Monte Carlo might be the better
approach, which is not what I had expected.


--
Dr Tristram J. Scott
Energy Consultant

Subject: What is the probability that random integers sum to a given value?

From: John D'Errico

Date: 1 Jul, 2009 16:58:02

Message: 23 of 24

tristram.scott@ntlworld.com (Tristram Scott) wrote in message <RvL2m.1720$HE6.725@newsfe27.ams2>...
> David Heslop <david_heslop@xyz.com> wrote:
> > The question of how far I need to go into the tails is a bit of an open
> > one. I'm using this approach in an attempt to identify anomalous events
> > within a time series. Rather than saying that a given event passes a
> > certain threshold, for example 95 or 99, I have been trying to assign
> > specific probabilities. This means that sometimes I'm visiting the extremes
> > of the tails and often with the MC approach a zero probability is
> > returned.
> >
>
> If the number of observations is large enough for CLT, use that. If it
> isn't, calculate the exact PDF for the problem with replacement, and use
> that. You can calculate the exact PDF by a repeated convolution.

True. conv will do it, IFF you allow replacement.

But it is not correct when sampling is done without
replacement.

format rat
P = ones(1,3)/3
P =
       1/3 1/3 1/3

conv(P,P)
ans =
       1/9 2/9 1/3 2/9 1/9

This suggests that a sum of 2 = 1 + 1 occurs 1/9 of the time.
But if we are sampling without replacement, obviously we know
that 2 can never occur. So for the sum of two numbers taken
without replacement from the set {1,2,3}, the actual
distribution of sums will be the sums {3,4,5}, each with
probability 1/3.

This is very different from that which conv will yield.

For small population sizes, the solution is to use nchoosek.
Thus, for a population size of n = 5 with a sample size
of 2, we get these exact frequencies:

C = accumarray(sum(nchoosek(1:5,2),2),1);
m = find(C);
[m,C(m)/sum(C)]

ans =
       3 1/10
       4 1/10
       5 1/5
       6 1/5
       7 1/5
       8 1/10
       9 1/10

Or, here it is for n = 15 and k = 5.

C = accumarray(sum(nchoosek(1:15,5),2),1);
m = find(C);
[m,C(m)/sum(C)]
ans =
      15 1/3003
      16 1/3003
      17 2/3003
      18 1/1001
      19 5/3003
      20 1/429
      21 10/3003
      22 1/231
      23 6/1001
      24 23/3003
      25 10/1001
      26 12/1001
      27 15/1001
      28 53/3003
      29 3/143
      30 24/1001
      31 83/3003
      32 92/3003
      33 103/3003
      34 37/1001
      35 11/273
      36 127/3003
      37 134/3003
      38 137/3003
      39 47/1001
      40 47/1001
      41 47/1001
      42 137/3003
      43 134/3003
      44 127/3003
      45 11/273
      46 37/1001
      47 103/3003
      48 92/3003
      49 83/3003
      50 24/1001
      51 3/143
      52 53/3003
      53 15/1001
      54 12/1001
      55 10/1001
      56 23/3003
      57 6/1001
      58 1/231
      59 10/3003
      60 1/429
      61 5/3003
      62 1/1001
      63 2/3003
      64 1/3003
      65 1/3003

I'll bet that even this curve is getting close to
a normal distribution though. The central limit
theorem rules.

bar(m,C(m))

You can always use the nchoosek solution (which
is provably exact) up to the point where it takes too
time/ram to generate, and then switch to the normal
approximation.

John

Subject: What is the probability that random integers sum to a given value?

From: Bruno Luong

Date: 1 Jul, 2009 21:03:01

Message: 24 of 24

For fun, I compute the probability for case for of m=50, n=500 WITH replacement, using John's VPI

The total number of cases is about 9.8149e+134, or more precisely

981490936196259811119329333602238174031516339730798864287154519317011050489257526227332060984694506759464048548759295913146200306275001

cases, which is about the square of the linear size of the observable universe count in Plank-length! It takes about 15 minutes to run on my computer. I guess the result is actually closer to the true Gaussian than the double-float evaluation of exp(-x^2/2) could even give.

m=50; % vector length (w)
n=500; % define n such that 0<=w<=n

% engine
% possible sum vector
tic
s = (0:n*m);
% Compute probabillity
c = zeros(1,m*(n-1)+1);
i = cumsum([n-m+2 repmat(n+1,1,m-2)]);
M = vpi(m); % Use John's VPI class
for k=1:m-1
    c(i(k)) = (-1)^k*nchoosek(M,k);
end
for k=0:m-1
    fprintf('k=%d/%d\n', k, m-1);
    c = cumsum([nchoosek(M-1,k) c]);
end
cdbl = double(c);
P = cdbl / sum(cdbl);
toc

% Graphic
figure(2);
clf
plot(s, P, 'r.');
xlabel('sum');
ylabel('P(sum)');
title(sprintf('vector length = %d, n = %d', m, n));

% Bruno

Tags for this Thread

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us