Path: news.mathworks.com!not-for-mail
From: "Vivek " <vivek_mutalik@yahoo.com>
Newsgroups: comp.soft-sys.matlab
Subject: Re: Is this kind of regression possible?
Date: Fri, 30 Nov 2007 15:08:14 +0000 (UTC)
Organization: University of California, San Francisco
Lines: 141
Message-ID: <fip90u$9eu$1@fred.mathworks.com>
References: <fio8nm$6k4$1@fred.mathworks.com> <fior8p$asa$1@fred.mathworks.com>
Reply-To: "Vivek " <vivek_mutalik@yahoo.com>
NNTP-Posting-Host: webapp-03-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1196435294 9694 172.30.248.38 (30 Nov 2007 15:08:14 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Fri, 30 Nov 2007 15:08:14 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 998324
Xref: news.mathworks.com comp.soft-sys.matlab:440217


"John D'Errico" <woodchips@rochester.rr.com> wrote in
message <fior8p$asa$1@fred.mathworks.com>...
> "Vivek " <vivek_mutalik@yahoo.com> wrote in message 
> <fio8nm$6k4$1@fred.mathworks.com>...
> > Hi,
> > 
> > I m having difficulty in formulating following problem. If
> > you have any suggestions that'll be great.
> 
> As you state, you are having difficulty in
> formulating what you need to do. But you
> need to explain your problem well in order
> for us to help you.
> 
>  
> > Ive set of "aligned DNA sequences" with their activities. I
> > want to do regression so that i can get weights for each
> > base (A,C,G,T). This may help me in understanding which
> > bases are 'important and contribute' towards measured
activity. 
> > Example: My activity VS sequence table looks like
> > (1) 08 ACAG
> > (2) 10 ATTC
> > (3) 05 GGTA
> > (4) 04 CCGT
> > (5) ... ....
> >    ....etc
> > 
> > My solution would be: to minimize the residual sum of
square:
> > (here W is weight of that particular base, which is what im
> > trying to estimate)
> > 
> > = [8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
> > +W4C)]^2 + and so on.
> 
> Here from what you say, it looks like
> {W1, W2, W3, W4} are numbers, unknowns
> that you wish to determine?
> 
> Likewise, are {A,C,G,T} also each scalar
> numbers? What do they denote? Do these
> "variables" have known values?
> 
> 
> > to reduce the parameters to be determined, I can substract
> > weight of 'T' from each of weights and finally add sum of
> > all 'T' (as if 'T' is in all positions).
> > 
> > That is:
> > 
> > = [8 - (W1A-W1T + W2C-W2T + W3A-W3T + W4G-W4T) +
> > (W1T+W2T+W3T+W4T)]^2
> >  +
> > [10 - (W1A-W1T + W4C -W1T) + (W1T+W2T+W3T+W4T) ]^2
> > 
> > + so on;
> 
> I don't see how this reduces anything. In
> fact, its not even mathematically equivalent
> to the original expression. You are essentially
> ADDING 2*(W1T+W2T+W3T+W4T) inside
> each term in the sum. Note that in
> mathematics, two negatives do indeed
> make a positive.
> 
> Even if you get your signs right, what is
> the purpose of this operation?
> 
>  
> > Is it making any sense? So by this way, i was thinking of
> > getting weights for all bases by using some kind of residual
> > minimizing function. Is it possible ?
> 
> I'm sorry, you are not making sense yet, at
> least not completely. But please try again.
> 
> John
> 
-------------------------------------------------------
Sorry for not being clear. here my answers.

1.{W1, W2, W3, W4} are unknowns i wish to deterimine

2.are {A,C,G,T} also each scalar numbers? What do they
denote? Do these "variables" have known values?:

They are not numbers. (Assuming nonbiological readers) DNA
is made up of these ACGT 's Example: AGCTGCTAACAGT...
So each sequence (string)is made up of these. So when i
align them, then i'll know how many times A is at 1st
position, how many times C is at second position, and so on.
So in way, you know their occurance at particular position
but dont have known values.

3. Parameter reduction: its not even mathematically
equivalent.......

I was thinking of something like this:
Following matrix represents "assumed values for assumed
sequences (not from my original post)" as if i've calculated
weights for each letter at different positions, then i'll have:
 
position  1   2   3   4

A         8   5  -9  -3
C        -4  -7   -3  3
G         3   1   1  -4
T         5  -4  -5   6
  
now, this matrix can be used to evaluate any sequence by
extracting and adding the element from each  column that
corresponds to the sequence. 
For example, for sequence: if i have AAGT, then i pick 
AAGT = 8+ 5+1+6 = 20
CCGT = (-4)+(-7)+1+6= -4

So lets look at reducing the variables: See this matrix and
i explain about it.

position  1   2   3   4

A         3   9  -4  -9
C        -9  -3   2  -3
G        -2   5   6  -10
T         0   0   0   0
For example, for sequence: if i have AAGT, then i pick 
AAGT = 3+9+6+0 = 18 + 2 = 20
Here 2 is from adding all T's ie. 5-4-5+6 = 2

to reduce the variables, the value for each 'T' is set to
zero and a constant term is added. This constant term is the
value given to the sequence of all T. The other elements are
the differences between having a T at each position and each
of the other bases. SO the total number of variables is
three times the number of positions, plus one. Is this known
as dummy encoding in statistical analysis ? not sure.
Anyway, i think this will boils down to the equation i
posted ealrier. Please comment if it is not yet clear.
Thanks again for writing back.