|
On Nov 3, 7:04 pm, "Tom Lane" <tl...@mathworks.com> wrote:
> > I understand. I have another question which is a little more deeper
> > than this. Suppose I have two vectors x1 and x2 and another vector y,
> > now if x1 and x2 are independent of each other, (meaning corr(x,y) =
> > 0, say), then I could find the correlation between my so called
> > "features" x1 and x2 and "label" y separately in a straightforward
> > fashion. However, my question is how to find the correlation if x1 and
> > x2 are indeed dependent on each other. Wouldn't the correlation
> > measure in this case calculated as corr(x1,y) and corr(x2,y) be biased
> > or incorrect in this case??
>
> Arun, I don't think I understand your concern.
>
> Suppose you are interested in corr(x1,y). I could always generate another x2
> that is either correlated with x1 or not. How would my doing that cause your
> correlation to become biased?
>
> There is a notion of multiple correlation. Its squared value is the R^2
> statistic for a regression. It measures the correlation between y and the
> linear combination of the x's obtained by regressing y on the x's.
>
> There's also the notion of the partial correlation, where you measure the
> correlation between two variables after "removing" the effect of another
> variable.
>
> I'm not sure if these two things are related to your concern, though.
>
> -- Tom
Hi Tom,
Thank you once again. I am sorry for replying late. I will try to
explain my problem in detail.
I have a set of random variables [x1,x2,...xn] each of which is a m*1
column vector. Lets call it *features*. And I have another random
variable y (m*1 column vector) which we call *label*. My idea is to
find out which of these features (1...n) has the maximum dependency on
y. That is, I would like to, say, find out the top 10 most significant
features.
Let me explain what these features are. If you consider the features
as a whole, it is a m*n matrix. In this, each *ROW* represents, say, a
student from a particular region who are/aren't affected with a
particular disorder (some of them might be from a very close region or
from the same or very different places). For each student, the n
entries represent a certain entry (nucleotides A,C,G or T)
corresponding to the most important locations on the chromosome. These
are the potential locations of interest in all these students for this
particular disorder. Each entry of the label denotes the corresponding
outcome, that is, if the student has this disorder or not. If you
follow this, then each column represent one particular location of
chromosome for all the students. So, basically, my idea is to find
which chromosome location is very much responsible for the disorder.
My problem is that, the samples (students) come from different regions
or very close by regions which might make them a bit dependent on each
other. This leads to a population bias which demands removing this
dependency amongst the students while checking for the dependency of
disorders. For example, if two students are brothers, then if the
disorder is likely to be present in both of them. I hope its a bit
clear, at least, now?
My question is when I find the correlation over different locations of
chromosome Vs their disorder, how do I make sure that the dependency
of students (or subjects) are minimized?
Thank you very much once again. I would appreciate it if you could
provide me some ideas. I already read about partial correlation, may
be it is a bit close to what I seek...
best, arun.
|