Path: news.mathworks.com!not-for-mail
From: "vicky " <vivek_mutalik@yahoo.com>
Newsgroups: comp.soft-sys.matlab
Subject: Re: Is this kind of regression possible?
Date: Sat, 1 Dec 2007 03:28:32 +0000 (UTC)
Organization: University of California, San Francisco
Lines: 143
Message-ID: <fiqkd0$a9h$1@fred.mathworks.com>
References: <fio8nm$6k4$1@fred.mathworks.com> <fiqisu$l62$1@fred.mathworks.com>
Reply-To: "vicky " <vivek_mutalik@yahoo.com>
NNTP-Posting-Host: webapp-05-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1196479712 10545 172.30.248.35 (1 Dec 2007 03:28:32 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Sat, 1 Dec 2007 03:28:32 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 998324
Xref: news.mathworks.com comp.soft-sys.matlab:440317


"Roger Stafford" <ellieandrogerxyzzy@mindspring.com.invalid>
wrote in message <fiqisu$l62$1@fred.mathworks.com>...
> "Vivek " <vivek_mutalik@yahoo.com> wrote in message
<fio8nm$6k4
> $1@fred.mathworks.com>...
> > Hi,
> > 
> > I m having difficulty in formulating following problem. If
> > you have any suggestions that'll be great.
> > 
> > Ive set of "aligned DNA sequences" with their activities. I
> > want to do regression so that i can get weights for each
> > base (A,C,G,T). This may help me in understanding which
> > bases are 'important and contribute' towards measured
activity. 
> > Example: My activity VS sequence table looks like
> > (1) 08 ACAG
> > (2) 10 ATTC
> > (3) 05 GGTA
> > (4) 04 CCGT
> > (5) ... ....
> >    ....etc
> > 
> > My solution would be: to minimize the residual sum of
square:
> > (here W is weight of that particular base, which is what im
> > trying to estimate)
> > 
> > = [8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
> > +W4C)]^2 + and so on.
> > ...SNIP...
> --------
>   Let&#8217;s see if I understand what you are saying, Vivek. 
Each DNA sequence 
> position is to possess four different weight values
corresponding to the four 
> possible bases, so that with n sequence positions you
would have 4*n 
> variables to adjust.  I am assuming each of the DNA
sequences in your set 
> has the same length, namely n.  You want them to be such
as to minimize the 
> sum of the squares:
> 
>  (W1A+W2C+W3A+W4G-8)^2 + (W1A+W2T+W3T+W4C-10)^2 + ...
> 
> in accordance with your set of aligned DNA sequences.
> 
>   If that is what you want, this can indeed be solved
using the backslash 
> operator with no need for the Optimization Toolbox.  (I
have no idea what 
> you have in mind with the subtraction of 'T' weights to
reduce the number of 
> parameters, but such a manipulation is not necessary to
solve the problem.)  
> Your problem has a very definite, unique answer, provided
the number of 
> different sequences is equal to or greater than four times
the number, n, of 
> sequence positions.  
> 
>   Write a matrix M that looks like this for your
particular example with n = 4:
> 
> M = [1 0 0 0  0 1 0 0  1 0 0 0  0 0 1 0; % ACAG
>      1 0 0 0  0 0 0 1  0 0 0 1  0 1 0 0; % ATTC
>      0 0 1 0  0 0 1 0  0 0 0 1  1 0 0 0; % GGTA
>      0 1 0 0  0 1 0 0  0 0 1 0  0 0 0 1; % CCGT
>      ......
> 
> Here, the binary sequence 1 0 0 0 occurs for A, 0 1 0 0
for C, 0 0 1 0 for G, 
> and 0 0 0 1 for T.  The column vector of "activities"
would be:
> 
> A = [8;10;5;4;.....];
> 
> Then the equation
> 
> W = M\A
> 
> would provide the answer you seek, where it is understood that
> 
> W = [W1A;W1C;W1G;W1T;
>      W2A;W2C;W2G;W2T;
>      W3A;W3C;W3G;W3T;
>      W4A;W4C;W4G;W4T];
> 
>   Note that in this case with n = 4, you would need at
least 4*4 = 16 
> sequences to ensure a unique answer.  In case there are
more sequences, 
> matlab&#8217;s backslash operator would provide a least squares
answer.
> 
>   To obtain the M matrix you would undoubtedly want some
routine that 
> would convert whatever representation you are using for
the four bases over 
> to the corresponding four-element binary numbers.  If the
above is actually 
> what you need, then I am sure one of us can provide such a
routine if you 
> state how you are representing these bases in a given set
of sequences (i.e., 
> as ASCII characters, numbers from 1 to 4, etc.)
> 
> Roger Stafford
> 
------------------------
Hi Roger,

Thanks very much for responding to my query and offering to
help me out. 

I must say, You have understood my question correctly. I
appreciate your solution to the problem. Ive tried that.
taken just binary 1 or 0 representation for each base
instead of 1000, 0100, type. Im thankful to Walter Roberson
who gave me that routine.

Most important issue with my result was that my matrix X,
which describes these binary representations is not full
rank. I think some columns are "collinear (or
multicollinear)". Matrix Y is my activity vector. So i was
trying to use PLSR to avoid this collinearity and to include
full DNA sequence (about 50 letter length), which was also
not successfull due to lack of correlations. Im still
working on that.

Meanwhile, I was thinking of why not just take all equations
in one residual sum of square equation and solve it for
smaller segments of DNA (Not sure if optimization toolbox
would be helpful in that). I was trying to eliminate 'T' so
that i can have less variables to handle. 
Someone suggested that this may be easy in mathematica. I
dont understand how will that be. If it is easy, then Matlab
will be my choice.
Thanks again. Do send your comments.
Regards,
Vivek