Asked by promo
on 30 Jan 2013

Hello,

My goal is to compute the 95% confidence interval of a random variable, say X, distributed according to the normal distribution with mean 2 and variance 9/100. The code I have so far is as follows

BEGINNING OF CODE

conventionalCI = zeros(2,1000);

for i=1:1000

%reset values for next round, i+1 rr = zeros(1,200);

%Here we are drawing directly from the distrution N(2,9/100), which is the %distribution of the sameple means of a random variable from N(2,9).

rr = 2 + (3/10).*normrnd(0,1,1,200);

%The quantiles for the data above is recorded

conventionalCI(1,i) = quantile(sort(rr),.975); conventionalCI(2,i) = quantile(sort(rr),.025);

end

%Proportion (for conventional confidence intervals)

propCI = zeros(1,1000); for i=1:1000 if (2>=conventionalCI(2,i)) & (conventionalCI(1,i)>=2) propCI(i)=1; else propCI(i)=0; end

end

proportionCI = sum(propCI)/1000

END OF CODE

In essence what I'm doing is drawing from the distribution N(2,9/100) with a sample of size 200, saving the 97.5 and 2.5 percentiles in the variable conventionalCI and calculating the proportion of times the true mean, 2, falls in the intervals in conventionalCI.

The problem is I keep getting the 1 for the proportion of times 2 falls in the interval which doesn't make sense given that I'm trying to construct a 95% confidence interval. Can anyone figure out why this is the case?

I appreciate your help

*No products are associated with this question.*

Answer by the cyclist
on 30 Jan 2013

Edited by the cyclist
on 2 Feb 2013

Suppose your friend gives you a sample (drawn from a parent distribution that you do not know). Maybe that sample is

>> x = [1 1.5 1.8 1.9 1.9 1.9 1.9 2 2 2 2 2 2.1 2.1 2.1 2.1 2.2 2.5 3];

It happens to have a sample mean of exactly 2. (If I were not lazy, I could have given you lots more data, and told you the confidence interval as well.) *You do not know what the parent mean is*. The parent distribution might have a mean of 2.04, or 1.93. The *most likely* parent mean is 2, and parent means further and further from 2 are statistically less likely. Under normal distribution assumptions, you can calculate how likely each value of the parent mean is.

Now -- and this is the point that is specifically important for your code -- you cannot test that CI by drawing over and over again from a parent distribution with mean 2. That would just be assuming the exact thing you are attempting to test! You will almost always ( *way* more than 95%) have a sample mean within your CI.

Instead, you would need to test by drawing from the entire range of possible parent distributions, in proportion to their probability of having generated the sample data that your friend gave you.

This is a very subtle but important distinction. I hope I've helped your understanding a little.

EDIT

OK, so I finally found some time to sit down and write out some code that illustrates what you can do. I tried to comment it richly and use explicit variable names to help you understand what's going on. The core of what you tried to do is still there, which I hope will guide you.

function proportionCI = howOftenConfidenceIntervalContainsMean()

% Set the random number seed. [Older MATLAB version don't have the rng() function. Modify accordingly.] rng(1)

SAMPLESIZE = 200;

% Create a "random" sample. For the rest of the code, you have to pretend that you never % knew the true parameters of the parent distribution. rr = 2 + 3*normrnd(0,1,1,SAMPLESIZE);

sampleMean = mean(rr); sampleStandardDeviation = std(rr); sampleStandardErrorOfMean = sampleStandardDeviation/sqrt(SAMPLESIZE);

confidenceIntervalOfMean = sampleMean + 1.96*sampleStandardErrorOfMean*[1; -1];

% Now, here's the slightly tricky part. You want the distribution of the possible parent means % that might have resulted in your sample. But you don't know what those are. (You actually % can derive it theoretically, but we are going to get it here by simulation.)

% The trick that enable us to do this by sampling is that we have an APPROXIMATION to the % parent distribution ... our one sample! So, because we can't sample the parent, instead % we SAMPLE THE SAMPLE! (This trick is known as resampling.) We will resample many times, % and each one of those resamples gives us an approximation to the parent mean. We check those % approximate parent means to see if they lie within the confidence interval as often as we % expect.

NUMBER_RESAMPLE_TRIALS = 100000;

% Do the resampling (lots of times), and see what proportion of the (approximate) parent means % lie within the confidence intervals propCI = zeros(1,NUMBER_RESAMPLE_TRIALS);

for ni=1:NUMBER_RESAMPLE_TRIALS

if mod(ni,1000)==0 disp(['Trial ',num2str(ni),' of ',num2str(NUMBER_RESAMPLE_TRIALS)]) end

% For each trial we generate a (re)sample. resampledSample = rr(randsample(SAMPLESIZE,SAMPLESIZE,true));

resampledMean = mean(resampledSample); % Each trials gives one approximation of the parent mean

% Test to see if the parent mean lies within the confidence interval if (resampledMean>=confidenceIntervalOfMean(2)) && (confidenceIntervalOfMean(1)>=resampledMean) propCI(ni)=1; else propCI(ni)=0; end end

proportionCI = sum(propCI)/NUMBER_RESAMPLE_TRIALS

end

Show 1 older comment

the cyclist
on 30 Jan 2013

No, it's not a trivial couple lines of code, I don't believe. Sorry. But if you understand the theory well, there is nothing too difficult to code.

Just as a bit more background, this presentation http://www.astro.umass.edu/~schloerb/ph281/Lectures/NormalDistribution/NormalDistribution.pdf has the theory that I am talking about. About 3/4 of the way in, there is a slide with "Most likely value of mean is the sample mean" written in red font. (The slides aren't numbered, sorry.) At the top of that slide you see the formula for the relative probability that the sample comes from a parent with true mean mu.

the cyclist
on 2 Feb 2013

Youssef Khmou
on 4 Feb 2013

hi, yes now it looks good how to manipulate the same sample instead of generating many .

thanks .

Answer by Youssef Khmou
on 30 Jan 2013

HI,

Try this modified version, :

%KHMOU Youssef, Statistics :exmpl1 30/01/2013 5:00 AM

N=1000;

%sample=2.00+(3/10)*randn(1,N);

sample=normrnd(2.00,3/10,1,N);

conventionalCI = zeros(2,N);

propCI=zeros(1,N);

%sample=sort(sample); %p=length(sample); %n=p/10;

for i=1:N

sample=normrnd(2.00,3/10,1,i);

a=quantile(sort(sample),.975);

b=quantile(sort(sample),.025);

if (2.00000>=b) && (a>=2.00000)

propCI(i)=1;

else

propCI(i)=0;

end

end

proportionCI = sum(propCI)/N

plot(propCI(1:50))

=============================================================

After each iteration i, the RANDOM sample increases in length, Does it give the result you want ? anyway try to change the number of samples per example :

mean(propCI(1:100))

mean(propCI(1:120))

mean(propCI(1:200))

mean(propCI(1:300))

As the the number of samples increases, th mean tends to 1.

i Hope that helps :

KHMOU Yussef.

Answer by Youssef Khmou
on 4 Feb 2013

HI promo,

The code provided by "the cyclist" does exactly the task ,

Just as an addition, tape in the command Window :

>>randtool

the GUI uses the "resampling" technique mentioned above .

Opportunities for recent engineering grads.

## 0 Comments