Path: news.mathworks.com!not-for-mail
From: "John D'Errico" <woodchips@rochester.rr.com>
Newsgroups: comp.soft-sys.matlab
Subject: Re: What is the probability that random integers sum to a given value?
Date: Wed, 1 Jul 2009 10:29:02 +0000 (UTC)
Organization: John D'Errico (1-3LEW5R)
Lines: 84
Message-ID: <h2fdpe$ed9$1@fred.mathworks.com>
References: <h2de1d$sgb$1@fred.mathworks.com> <h2dvi2$sn3$1@fred.mathworks.com> <h2e02t$4bt$1@fred.mathworks.com> <h2e16h$hfr$1@fred.mathworks.com> <h2eur9$db8$1@fred.mathworks.com> <h2f1l9$4fo$1@fred.mathworks.com> <h2f4om$s57$1@fred.mathworks.com> <h2f5s9$bnl$1@fred.mathworks.com> <p5G2m.8$AX1.0@newsfe20.ams2>
Reply-To: "John D'Errico" <woodchips@rochester.rr.com>
NNTP-Posting-Host: webapp-05-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1246444142 14761 172.30.248.35 (1 Jul 2009 10:29:02 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Wed, 1 Jul 2009 10:29:02 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 869215
Xref: news.mathworks.com comp.soft-sys.matlab:551941


tristram.scott@ntlworld.com (Tristram Scott) wrote in message <p5G2m.8$AX1.0@newsfe20.ams2>...
> David Heslop <david_heslop@xyz.com> wrote:
> > Hi Bruno,
> > sorry if this led to confusion, I hoped that I had made this aspect
> > explicit in the original problem when stating that the integers had to be
> > "unique".  I must have got mixed up with my stats terminology,
> > Dave
> 
> Unique does imply that the sampling is without replacement, but it would
> have avoided some confusion if you had been explicit on this.
> 
> In practice, though, I can't see it making much difference at all for the
> problem you describe.  The exception is going to be right at the tails of
> the distribution.

I thought that the whole thing would hinge on
the replacement issue. The question is how bad
will it be if we ignore replacement? The birthday
paradox suggests that the probability of a
replacement conflict is high if we use one
method when the other applies. On a population
size of 500, a sample of 50 that is done with
replacement is very likely to have one or more
matches.

On a population size of 500 and a sample size of
50, sampling with replacement will yield a very
normal (Gaussian) looking curve. The expected
value of the sum WITH replacement will be

  k*(n+1)/2

for samples of size k in a population of n. So for
n = 500 and k = 50, this is 

n = 500;k = 50;
k*(n+1)/2
ans =
       12525

Lets try a sample without replacement.

[tags,tags] = sort(rand(10000,500),2);
S = sum(tags(:,1:50),2);

Over 10000 iterations of the process, the distribution
of the sum looks quite normally distributed.

hist(S,100)

The skewness should be zero for a normal, and 
the Kurtosis should be 3.

kurtosis(S)
ans =
       2.9686

skewness(S)
ans =
    0.0044813

As well, the mean of our samples should be 12525
for the samples with replacement, and this is included
in the confidence intervals on our samples without
replacement.

[MUHAT,SIGMAHAT,MUCI,SIGMACI] = normfit(S)
MUHAT =
        12513
SIGMAHAT =
       966.09
MUCI =
        12494
        12532
SIGMACI =
       952.89
       979.67

So it looks to me after a quick test that it matters
relatively little if you sample with or without
replacement, at least if your population is as large
as 500.

John