Thread Subject: p-values

Subject: p-values

From: arun

Date: 2 Nov, 2009 12:16:27

Message: 1 of 6

Hi,

I have 2 random variables x and y. I calculate the correlation between
x and y. Then I permute y a 1000 times and compute the correlation
each time between x and permuted y (bootstrap approach). Could anyone
suggest how to compute the p-values from this??

Subject: p-values

From: Tom Lane

Date: 2 Nov, 2009 22:15:53

Message: 2 of 6

> I have 2 random variables x and y. I calculate the correlation between
> x and y. Then I permute y a 1000 times and compute the correlation
> each time between x and permuted y (bootstrap approach). Could anyone
> suggest how to compute the p-values from this??

Well, the corr (Statistics Toolbox) and corrcoef (MATLAB) functions will
compute p-values for you:

>> x = randn(10,1);
>> y = .6*x + randn(size(x));
>> [r,p] = corr(x,y)
r =
    0.7564
p =
    0.0114

But if you want to do this by simulation, notice that if the y values are
permuted randomly, there should be no correlation with x. This gives you a
random set of sample correlations with a distribution under the null
hypothesis of no correlation. You could just see what proportion of them
exceed the actual correlation you measured for your data:

>> rv = zeros(1000,1);
>> for j=1:1000; rv(j) = corr(x,y(randperm(numel(y)))); end
>> mean(abs(rv)>.7564)
ans =
    0.0110

-- Tom

Subject: p-values

From: arun

Date: 3 Nov, 2009 17:19:23

Message: 3 of 6

On Nov 2, 11:15 pm, "Tom Lane" <tl...@mathworks.com> wrote:
> > I have 2 random variables x and y. I calculate the correlation between
> > x and y. Then I permute y a 1000 times and compute the correlation
> > each time between x and permuted y (bootstrap approach). Could anyone
> > suggest how to compute the p-values from this??
>
> Well, the corr (Statistics Toolbox) and corrcoef (MATLAB) functions will
> compute p-values for you:
>
> >> x = randn(10,1);
> >> y = .6*x + randn(size(x));
> >> [r,p] = corr(x,y)
>
> r =
>     0.7564
> p =
>     0.0114
>
> But if you want to do this by simulation, notice that if the y values are
> permuted randomly, there should be no correlation with x. This gives you a
> random set of sample correlations with a distribution under the null
> hypothesis of no correlation. You could just see what proportion of them
> exceed the actual correlation you measured for your data:
>
> >> rv = zeros(1000,1);
> >> for j=1:1000; rv(j) = corr(x,y(randperm(numel(y)))); end
> >> mean(abs(rv)>.7564)
>
> ans =
>     0.0110
>
> -- Tom

Hi Tom,
I understand. I have another question which is a little more deeper
than this. Suppose I have two vectors x1 and x2 and another vector y,
now if x1 and x2 are independent of each other, (meaning corr(x,y) =
0, say), then I could find the correlation between my so called
"features" x1 and x2 and "label" y separately in a straightforward
fashion. However, my question is how to find the correlation if x1 and
x2 are indeed dependent on each other. Wouldn't the correlation
measure in this case calculated as corr(x1,y) and corr(x2,y) be biased
or incorrect in this case??
thank you,
best, arun.

Subject: p-values

From: arun

Date: 3 Nov, 2009 17:20:00

Message: 4 of 6

On Nov 2, 11:15 pm, "Tom Lane" <tl...@mathworks.com> wrote:
> > I have 2 random variables x and y. I calculate the correlation between
> > x and y. Then I permute y a 1000 times and compute the correlation
> > each time between x and permuted y (bootstrap approach). Could anyone
> > suggest how to compute the p-values from this??
>
> Well, the corr (Statistics Toolbox) and corrcoef (MATLAB) functions will
> compute p-values for you:
>
> >> x = randn(10,1);
> >> y = .6*x + randn(size(x));
> >> [r,p] = corr(x,y)
>
> r =
>     0.7564
> p =
>     0.0114
>
> But if you want to do this by simulation, notice that if the y values are
> permuted randomly, there should be no correlation with x. This gives you a
> random set of sample correlations with a distribution under the null
> hypothesis of no correlation. You could just see what proportion of them
> exceed the actual correlation you measured for your data:
>
> >> rv = zeros(1000,1);
> >> for j=1:1000; rv(j) = corr(x,y(randperm(numel(y)))); end
> >> mean(abs(rv)>.7564)
>
> ans =
>     0.0110
>
> -- Tom

Hi Tom,
I understand. I have another question which is a little more deeper
than this. Suppose I have two vectors x1 and x2 and another vector y,
now if x1 and x2 are independent of each other, (meaning corr(x,y) =
0, say), then I could find the correlation between my so called
"features" x1 and x2 and "label" y separately in a straightforward
fashion. However, my question is how to find the correlation if x1 and
x2 are indeed dependent on each other. Wouldn't the correlation
measure in this case calculated as corr(x1,y) and corr(x2,y) be biased
or incorrect in this case??
thank you,
best, arun.

Subject: p-values

From: Tom Lane

Date: 3 Nov, 2009 18:04:41

Message: 5 of 6

> I understand. I have another question which is a little more deeper
> than this. Suppose I have two vectors x1 and x2 and another vector y,
> now if x1 and x2 are independent of each other, (meaning corr(x,y) =
> 0, say), then I could find the correlation between my so called
> "features" x1 and x2 and "label" y separately in a straightforward
> fashion. However, my question is how to find the correlation if x1 and
> x2 are indeed dependent on each other. Wouldn't the correlation
> measure in this case calculated as corr(x1,y) and corr(x2,y) be biased
> or incorrect in this case??

Arun, I don't think I understand your concern.

Suppose you are interested in corr(x1,y). I could always generate another x2
that is either correlated with x1 or not. How would my doing that cause your
correlation to become biased?

There is a notion of multiple correlation. Its squared value is the R^2
statistic for a regression. It measures the correlation between y and the
linear combination of the x's obtained by regressing y on the x's.

There's also the notion of the partial correlation, where you measure the
correlation between two variables after "removing" the effect of another
variable.

I'm not sure if these two things are related to your concern, though.

-- Tom

Subject: p-values

From: arun

Date: 7 Nov, 2009 14:26:29

Message: 6 of 6

On Nov 3, 7:04 pm, "Tom Lane" <tl...@mathworks.com> wrote:
> > I understand. I have another question which is a little more deeper
> > than this. Suppose I have two vectors x1 and x2 and another vector y,
> > now if x1 and x2 are independent of each other, (meaning corr(x,y) =
> > 0, say), then I could find the correlation between my so called
> > "features" x1 and x2 and "label" y separately in a straightforward
> > fashion. However, my question is how to find the correlation if x1 and
> > x2 are indeed dependent on each other. Wouldn't the correlation
> > measure in this case calculated as corr(x1,y) and corr(x2,y) be biased
> > or incorrect in this case??
>
> Arun, I don't think I understand your concern.
>
> Suppose you are interested in corr(x1,y). I could always generate another x2
> that is either correlated with x1 or not. How would my doing that cause your
> correlation to become biased?
>
> There is a notion of multiple correlation. Its squared value is the R^2
> statistic for a regression. It measures the correlation between y and the
> linear combination of the x's obtained by regressing y on the x's.
>
> There's also the notion of the partial correlation, where you measure the
> correlation between two variables after "removing" the effect of another
> variable.
>
> I'm not sure if these two things are related to your concern, though.
>
> -- Tom

Hi Tom,
Thank you once again. I am sorry for replying late. I will try to
explain my problem in detail.

I have a set of random variables [x1,x2,...xn] each of which is a m*1
column vector. Lets call it *features*. And I have another random
variable y (m*1 column vector) which we call *label*. My idea is to
find out which of these features (1...n) has the maximum dependency on
y. That is, I would like to, say, find out the top 10 most significant
features.
Let me explain what these features are. If you consider the features
as a whole, it is a m*n matrix. In this, each *ROW* represents, say, a
student from a particular region who are/aren't affected with a
particular disorder (some of them might be from a very close region or
from the same or very different places). For each student, the n
entries represent a certain entry (nucleotides A,C,G or T)
corresponding to the most important locations on the chromosome. These
are the potential locations of interest in all these students for this
particular disorder. Each entry of the label denotes the corresponding
outcome, that is, if the student has this disorder or not. If you
follow this, then each column represent one particular location of
chromosome for all the students. So, basically, my idea is to find
which chromosome location is very much responsible for the disorder.

My problem is that, the samples (students) come from different regions
or very close by regions which might make them a bit dependent on each
other. This leads to a population bias which demands removing this
dependency amongst the students while checking for the dependency of
disorders. For example, if two students are brothers, then if the
disorder is likely to be present in both of them. I hope its a bit
clear, at least, now?

My question is when I find the correlation over different locations of
chromosome Vs their disorder, how do I make sure that the dependency
of students (or subjects) are minimized?

Thank you very much once again. I would appreciate it if you could
provide me some ideas. I already read about partial correlation, may
be it is a bit close to what I seek...

best, arun.

Tags for this Thread

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

rssFeed for this Thread
 

MATLAB Central Terms of Use

NOTICE: Any content you submit to MATLAB Central, including personal information, is not subject to the protections which may be afforded information collected under other sections of The MathWorks, Inc. Web site. You are entirely responsible for all content that you upload, post, e-mail, transmit or otherwise make available via MATLAB Central. The MathWorks does not control the content posted by visitors to MATLAB Central and, does not guarantee the accuracy, integrity, or quality of such content. Under no circumstances will The MathWorks be liable in any way for any content not authored by The MathWorks, or any loss or damage of any kind incurred as a result of the use of any content posted, e-mailed, transmitted or otherwise made available via MATLAB Central. Read the complete Terms prior to use.

Contact us at files@mathworks.com