Path: news.mathworks.com!not-for-mail
From: "Roger Stafford" <ellieandrogerxyzzy@mindspring.com.invalid>
Newsgroups: comp.soft-sys.matlab
Subject: Re: Is this kind of regression possible?
Date: Sat, 1 Dec 2007 03:02:54 +0000 (UTC)
Organization: The MathWorks, Inc.
Lines: 83
Message-ID: <fiqisu$l62$1@fred.mathworks.com>
References: <fio8nm$6k4$1@fred.mathworks.com>
Reply-To: "Roger Stafford" <ellieandrogerxyzzy@mindspring.com.invalid>
NNTP-Posting-Host: webapp-03-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1196478174 21698 172.30.248.38 (1 Dec 2007 03:02:54 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Sat, 1 Dec 2007 03:02:54 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 1187260
Xref: news.mathworks.com comp.soft-sys.matlab:440310


"Vivek " <vivek_mutalik@yahoo.com> wrote in message <fio8nm$6k4
$1@fred.mathworks.com>...
> Hi,
> 
> I m having difficulty in formulating following problem. If
> you have any suggestions that'll be great.
> 
> Ive set of "aligned DNA sequences" with their activities. I
> want to do regression so that i can get weights for each
> base (A,C,G,T). This may help me in understanding which
> bases are 'important and contribute' towards measured activity. 
> Example: My activity VS sequence table looks like
> (1) 08 ACAG
> (2) 10 ATTC
> (3) 05 GGTA
> (4) 04 CCGT
> (5) ... ....
>    ....etc
> 
> My solution would be: to minimize the residual sum of square:
> (here W is weight of that particular base, which is what im
> trying to estimate)
> 
> = [8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
> +W4C)]^2 + and so on.
> ...SNIP...
--------
  Let&#8217;s see if I understand what you are saying, Vivek.  Each DNA sequence 
position is to possess four different weight values corresponding to the four 
possible bases, so that with n sequence positions you would have 4*n 
variables to adjust.  I am assuming each of the DNA sequences in your set 
has the same length, namely n.  You want them to be such as to minimize the 
sum of the squares:

 (W1A+W2C+W3A+W4G-8)^2 + (W1A+W2T+W3T+W4C-10)^2 + ...

in accordance with your set of aligned DNA sequences.

  If that is what you want, this can indeed be solved using the backslash 
operator with no need for the Optimization Toolbox.  (I have no idea what 
you have in mind with the subtraction of 'T' weights to reduce the number of 
parameters, but such a manipulation is not necessary to solve the problem.)  
Your problem has a very definite, unique answer, provided the number of 
different sequences is equal to or greater than four times the number, n, of 
sequence positions.  

  Write a matrix M that looks like this for your particular example with n = 4:

M = [1 0 0 0  0 1 0 0  1 0 0 0  0 0 1 0; % ACAG
     1 0 0 0  0 0 0 1  0 0 0 1  0 1 0 0; % ATTC
     0 0 1 0  0 0 1 0  0 0 0 1  1 0 0 0; % GGTA
     0 1 0 0  0 1 0 0  0 0 1 0  0 0 0 1; % CCGT
     ......

Here, the binary sequence 1 0 0 0 occurs for A, 0 1 0 0 for C, 0 0 1 0 for G, 
and 0 0 0 1 for T.  The column vector of "activities" would be:

A = [8;10;5;4;.....];

Then the equation

W = M\A

would provide the answer you seek, where it is understood that

W = [W1A;W1C;W1G;W1T;
     W2A;W2C;W2G;W2T;
     W3A;W3C;W3G;W3T;
     W4A;W4C;W4G;W4T];

  Note that in this case with n = 4, you would need at least 4*4 = 16 
sequences to ensure a unique answer.  In case there are more sequences, 
matlab&#8217;s backslash operator would provide a least squares answer.

  To obtain the M matrix you would undoubtedly want some routine that 
would convert whatever representation you are using for the four bases over 
to the corresponding four-element binary numbers.  If the above is actually 
what you need, then I am sure one of us can provide such a routine if you 
state how you are representing these bases in a given set of sequences (i.e., 
as ASCII characters, numbers from 1 to 4, etc.)

Roger Stafford