Skip to Main Content Skip to Search
Login
File Exchange
MATLAB Newsgroup
Link Exchange
  Blogs  
 Contest 
MathWorks.com

Thread Subject: Is this kind of regression possible?

Subject: Is this kind of regression possible?

From: VIVEK

Date: 30 Nov, 2007 05:57:10

Message: 1 of 31

Hi,

I m having difficulty in formulating following problem. If
you have any suggestions that'll be great.

Ive set of "aligned DNA sequences" with their activities. I
want to do regression so that i can get weights for each
base (A,C,G,T). This may help me in understanding which
bases are 'important and contribute' towards measured activity.
Example: My activity VS sequence table looks like
(1) 08 ACAG
(2) 10 ATTC
(3) 05 GGTA
(4) 04 CCGT
(5) ... ....
   ....etc

My solution would be: to minimize the residual sum of square:
(here W is weight of that particular base, which is what im
trying to estimate)

= [8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
+W4C)]^2 + and so on.

to reduce the parameters to be determined, I can substract
weight of 'T' from each of weights and finally add sum of
all 'T' (as if 'T' is in all positions).

That is:

= [8 - (W1A-W1T + W2C-W2T + W3A-W3T + W4G-W4T) +
(W1T+W2T+W3T+W4T)]^2
 +
[10 - (W1A-W1T + W4C -W1T) + (W1T+W2T+W3T+W4T) ]^2

+ so on;

Is it making any sense? So by this way, i was thinking of
getting weights for all bases by using some kind of residual
minimizing function. Is it possible ?

Subject: Re: Is this kind of regression possible?

From: John D'Errico

Date: 30 Nov, 2007 11:13:29

Message: 2 of 31

"Vivek " <vivek_mutalik@yahoo.com> wrote in message
<fio8nm$6k4$1@fred.mathworks.com>...
> Hi,
>
> I m having difficulty in formulating following problem. If
> you have any suggestions that'll be great.

As you state, you are having difficulty in
formulating what you need to do. But you
need to explain your problem well in order
for us to help you.

 
> Ive set of "aligned DNA sequences" with their activities. I
> want to do regression so that i can get weights for each
> base (A,C,G,T). This may help me in understanding which
> bases are 'important and contribute' towards measured activity.
> Example: My activity VS sequence table looks like
> (1) 08 ACAG
> (2) 10 ATTC
> (3) 05 GGTA
> (4) 04 CCGT
> (5) ... ....
> ....etc
>
> My solution would be: to minimize the residual sum of square:
> (here W is weight of that particular base, which is what im
> trying to estimate)
>
> = [8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
> +W4C)]^2 + and so on.

Here from what you say, it looks like
{W1, W2, W3, W4} are numbers, unknowns
that you wish to determine?

Likewise, are {A,C,G,T} also each scalar
numbers? What do they denote? Do these
"variables" have known values?


> to reduce the parameters to be determined, I can substract
> weight of 'T' from each of weights and finally add sum of
> all 'T' (as if 'T' is in all positions).
>
> That is:
>
> = [8 - (W1A-W1T + W2C-W2T + W3A-W3T + W4G-W4T) +
> (W1T+W2T+W3T+W4T)]^2
> +
> [10 - (W1A-W1T + W4C -W1T) + (W1T+W2T+W3T+W4T) ]^2
>
> + so on;

I don't see how this reduces anything. In
fact, its not even mathematically equivalent
to the original expression. You are essentially
ADDING 2*(W1T+W2T+W3T+W4T) inside
each term in the sum. Note that in
mathematics, two negatives do indeed
make a positive.

Even if you get your signs right, what is
the purpose of this operation?

 
> Is it making any sense? So by this way, i was thinking of
> getting weights for all bases by using some kind of residual
> minimizing function. Is it possible ?

I'm sorry, you are not making sense yet, at
least not completely. But please try again.

John


Subject: Re: Is this kind of regression possible?

From: Per Sundqvist

Date: 30 Nov, 2007 11:56:04

Message: 3 of 31

"Vivek " <vivek_mutalik@yahoo.com> wrote in message
<fio8nm$6k4$1@fred.mathworks.com>...
> Hi,
>
> I m having difficulty in formulating following problem. If
> you have any suggestions that'll be great.
>
> Ive set of "aligned DNA sequences" with their activities. I
> want to do regression so that i can get weights for each
> base (A,C,G,T). This may help me in understanding which
> bases are 'important and contribute' towards measured
activity.
> Example: My activity VS sequence table looks like
> (1) 08 ACAG
> (2) 10 ATTC
> (3) 05 GGTA
> (4) 04 CCGT
> (5) ... ....
> ....etc
>
> My solution would be: to minimize the residual sum of square:
> (here W is weight of that particular base, which is what im
> trying to estimate)
>
> = [8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
> +W4C)]^2 + and so on.
>
> to reduce the parameters to be determined, I can substract
> weight of 'T' from each of weights and finally add sum of
> all 'T' (as if 'T' is in all positions).
>
> That is:
>
> = [8 - (W1A-W1T + W2C-W2T + W3A-W3T + W4G-W4T) +
> (W1T+W2T+W3T+W4T)]^2
> +
> [10 - (W1A-W1T + W4C -W1T) + (W1T+W2T+W3T+W4T) ]^2
>
> + so on;
>
> Is it making any sense? So by this way, i was thinking of
> getting weights for all bases by using some kind of residual
> minimizing function. Is it possible ?

F=[8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
   +W4C)]^2 + and so on.

dF/dW1A=0, etc...

8+10=(1+1)*W1A+1*W2C+1*W3A+...

b=A*w

Hmm, this minimization looks like it is equal to solving a
linear equation Aw=b. You have 16 unknown right? W1A W2A W3A
W4A,W1C W2C,... =w, the unknown vector. So you need 16
equation at least to get these 16 weights. If you have more
equations you take the backslash-least square solution. You
will get a matrix A, which you have to work out by some
clever way, depending on how your data is arranged. (1+1)
shoul be replaced by the number of sequences that have A at
the first position, and in the element of b you sum the
values of these 8+10+....

Maby it hels you a little,
Per

Subject: Re: Is this kind of regression possible?

From: VIVEK

Date: 30 Nov, 2007 15:08:14

Message: 4 of 31

"John D'Errico" <woodchips@rochester.rr.com> wrote in
message <fior8p$asa$1@fred.mathworks.com>...
> "Vivek " <vivek_mutalik@yahoo.com> wrote in message
> <fio8nm$6k4$1@fred.mathworks.com>...
> > Hi,
> >
> > I m having difficulty in formulating following problem. If
> > you have any suggestions that'll be great.
>
> As you state, you are having difficulty in
> formulating what you need to do. But you
> need to explain your problem well in order
> for us to help you.
>
>
> > Ive set of "aligned DNA sequences" with their activities. I
> > want to do regression so that i can get weights for each
> > base (A,C,G,T). This may help me in understanding which
> > bases are 'important and contribute' towards measured
activity.
> > Example: My activity VS sequence table looks like
> > (1) 08 ACAG
> > (2) 10 ATTC
> > (3) 05 GGTA
> > (4) 04 CCGT
> > (5) ... ....
> > ....etc
> >
> > My solution would be: to minimize the residual sum of
square:
> > (here W is weight of that particular base, which is what im
> > trying to estimate)
> >
> > = [8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
> > +W4C)]^2 + and so on.
>
> Here from what you say, it looks like
> {W1, W2, W3, W4} are numbers, unknowns
> that you wish to determine?
>
> Likewise, are {A,C,G,T} also each scalar
> numbers? What do they denote? Do these
> "variables" have known values?
>
>
> > to reduce the parameters to be determined, I can substract
> > weight of 'T' from each of weights and finally add sum of
> > all 'T' (as if 'T' is in all positions).
> >
> > That is:
> >
> > = [8 - (W1A-W1T + W2C-W2T + W3A-W3T + W4G-W4T) +
> > (W1T+W2T+W3T+W4T)]^2
> > +
> > [10 - (W1A-W1T + W4C -W1T) + (W1T+W2T+W3T+W4T) ]^2
> >
> > + so on;
>
> I don't see how this reduces anything. In
> fact, its not even mathematically equivalent
> to the original expression. You are essentially
> ADDING 2*(W1T+W2T+W3T+W4T) inside
> each term in the sum. Note that in
> mathematics, two negatives do indeed
> make a positive.
>
> Even if you get your signs right, what is
> the purpose of this operation?
>
>
> > Is it making any sense? So by this way, i was thinking of
> > getting weights for all bases by using some kind of residual
> > minimizing function. Is it possible ?
>
> I'm sorry, you are not making sense yet, at
> least not completely. But please try again.
>
> John
>
-------------------------------------------------------
Sorry for not being clear. here my answers.

1.{W1, W2, W3, W4} are unknowns i wish to deterimine

2.are {A,C,G,T} also each scalar numbers? What do they
denote? Do these "variables" have known values?:

They are not numbers. (Assuming nonbiological readers) DNA
is made up of these ACGT 's Example: AGCTGCTAACAGT...
So each sequence (string)is made up of these. So when i
align them, then i'll know how many times A is at 1st
position, how many times C is at second position, and so on.
So in way, you know their occurance at particular position
but dont have known values.

3. Parameter reduction: its not even mathematically
equivalent.......

I was thinking of something like this:
Following matrix represents "assumed values for assumed
sequences (not from my original post)" as if i've calculated
weights for each letter at different positions, then i'll have:
 
position 1 2 3 4

A 8 5 -9 -3
C -4 -7 -3 3
G 3 1 1 -4
T 5 -4 -5 6
  
now, this matrix can be used to evaluate any sequence by
extracting and adding the element from each column that
corresponds to the sequence.
For example, for sequence: if i have AAGT, then i pick
AAGT = 8+ 5+1+6 = 20
CCGT = (-4)+(-7)+1+6= -4

So lets look at reducing the variables: See this matrix and
i explain about it.

position 1 2 3 4

A 3 9 -4 -9
C -9 -3 2 -3
G -2 5 6 -10
T 0 0 0 0
For example, for sequence: if i have AAGT, then i pick
AAGT = 3+9+6+0 = 18 + 2 = 20
Here 2 is from adding all T's ie. 5-4-5+6 = 2

to reduce the variables, the value for each 'T' is set to
zero and a constant term is added. This constant term is the
value given to the sequence of all T. The other elements are
the differences between having a T at each position and each
of the other bases. SO the total number of variables is
three times the number of positions, plus one. Is this known
as dummy encoding in statistical analysis ? not sure.
Anyway, i think this will boils down to the equation i
posted ealrier. Please comment if it is not yet clear.
Thanks again for writing back.

Subject: Re: Is this kind of regression possible?

From: VIVEK

Date: 30 Nov, 2007 15:22:52

Message: 5 of 31

"Per Sundqvist" <per.sundqvist@uam.es> wrote in message
<fiotok$a9s$1@fred.mathworks.com>...
> "Vivek " <vivek_mutalik@yahoo.com> wrote in message
> <fio8nm$6k4$1@fred.mathworks.com>...
> > Hi,
> >
> > I m having difficulty in formulating following problem. If
> > you have any suggestions that'll be great.
> >
> > Ive set of "aligned DNA sequences" with their activities. I
> > want to do regression so that i can get weights for each
> > base (A,C,G,T). This may help me in understanding which
> > bases are 'important and contribute' towards measured
> activity.
> > Example: My activity VS sequence table looks like
> > (1) 08 ACAG
> > (2) 10 ATTC
> > (3) 05 GGTA
> > (4) 04 CCGT
> > (5) ... ....
> > ....etc
> >
> > My solution would be: to minimize the residual sum of
square:
> > (here W is weight of that particular base, which is what im
> > trying to estimate)
> >
> > = [8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
> > +W4C)]^2 + and so on.
> >
> > to reduce the parameters to be determined, I can substract
> > weight of 'T' from each of weights and finally add sum of
> > all 'T' (as if 'T' is in all positions).
> >
> > That is:
> >
> > = [8 - (W1A-W1T + W2C-W2T + W3A-W3T + W4G-W4T) +
> > (W1T+W2T+W3T+W4T)]^2
> > +
> > [10 - (W1A-W1T + W4C -W1T) + (W1T+W2T+W3T+W4T) ]^2
> >
> > + so on;
> >
> > Is it making any sense? So by this way, i was thinking of
> > getting weights for all bases by using some kind of residual
> > minimizing function. Is it possible ?
>
> F=[8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
> +W4C)]^2 + and so on.
>
> dF/dW1A=0, etc...
>
> 8+10=(1+1)*W1A+1*W2C+1*W3A+...
>
> b=A*w
>
> Hmm, this minimization looks like it is equal to solving a
> linear equation Aw=b. You have 16 unknown right? W1A W2A W3A
> W4A,W1C W2C,... =w, the unknown vector. So you need 16
> equation at least to get these 16 weights. If you have more
> equations you take the backslash-least square solution. You
> will get a matrix A, which you have to work out by some
> clever way, depending on how your data is arranged. (1+1)
> shoul be replaced by the number of sequences that have A at
> the first position, and in the element of b you sum the
> values of these 8+10+....
>
> Maby it hels you a little,
> Per
>
-----------------------------
You are absolutely correct. W is unknown vector. I cant
have more unknowns than number of equations for
backslash-least square solution. Thats why i was thinking of
reducing my variables by taking T as zero. I was thinking of
adding all 8+10+ ...wont give right answer (do u think it
will?. If i keep them in one column vector (b), and ive
matrix A representing occurance of letters. i need to see
how will I understand which weight corresponds to which
position. Thanks for replying.

Subject: Re: Is this kind of regression possible?

From: VIVEK

Date: 30 Nov, 2007 23:27:32

Message: 6 of 31

..
"Vivek " <vivek_mutalik@yahoo.com> wrote in message
<fip9sc$n3e$1@fred.mathworks.com>...
> "Per Sundqvist" <per.sundqvist@uam.es> wrote in message
> <fiotok$a9s$1@fred.mathworks.com>...
> > "Vivek " <vivek_mutalik@yahoo.com> wrote in message
> > <fio8nm$6k4$1@fred.mathworks.com>...
> > > Hi,
> > >
> > > I m having difficulty in formulating following problem. If
> > > you have any suggestions that'll be great.
> > >
> > > Ive set of "aligned DNA sequences" with their
activities. I
> > > want to do regression so that i can get weights for each
> > > base (A,C,G,T). This may help me in understanding which
> > > bases are 'important and contribute' towards measured
> > activity.
> > > Example: My activity VS sequence table looks like
> > > (1) 08 ACAG
> > > (2) 10 ATTC
> > > (3) 05 GGTA
> > > (4) 04 CCGT
> > > (5) ... ....
> > > ....etc
> > >
> > > My solution would be: to minimize the residual sum of
> square:
> > > (here W is weight of that particular base, which is
what im
> > > trying to estimate)
> > >
> > > = [8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
> > > +W4C)]^2 + and so on.
> > >
> > > to reduce the parameters to be determined, I can substract
> > > weight of 'T' from each of weights and finally add sum of
> > > all 'T' (as if 'T' is in all positions).
> > >
> > > That is:
> > >
> > > = [8 - (W1A-W1T + W2C-W2T + W3A-W3T + W4G-W4T) +
> > > (W1T+W2T+W3T+W4T)]^2
> > > +
> > > [10 - (W1A-W1T + W4C -W1T) + (W1T+W2T+W3T+W4T) ]^2
> > >
> > > + so on;
> > >
> > > Is it making any sense? So by this way, i was thinking of
> > > getting weights for all bases by using some kind of
residual
> > > minimizing function. Is it possible ?
> >
> > F=[8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
> > +W4C)]^2 + and so on.
> >
> > dF/dW1A=0, etc...
> >
> > 8+10=(1+1)*W1A+1*W2C+1*W3A+...
> >
> > b=A*w
> >
> > Hmm, this minimization looks like it is equal to solving a
> > linear equation Aw=b. You have 16 unknown right? W1A W2A W3A
> > W4A,W1C W2C,... =w, the unknown vector. So you need 16
> > equation at least to get these 16 weights. If you have more
> > equations you take the backslash-least square solution. You
> > will get a matrix A, which you have to work out by some
> > clever way, depending on how your data is arranged. (1+1)
> > shoul be replaced by the number of sequences that have A at
> > the first position, and in the element of b you sum the
> > values of these 8+10+....
> >
> > Maby it hels you a little,
> > Per
> >
> -----------------------------
> You are absolutely correct. W is unknown vector. I cant
> have more unknowns than number of equations for
> backslash-least square solution. Thats why i was thinking of
> reducing my variables by taking T as zero. I was thinking of
> adding all 8+10+ ...wont give right answer (do u think it
> will?. If i keep them in one column vector (b), and ive
> matrix A representing occurance of letters. i need to see
> how will I understand which weight corresponds to which
> position. Thanks for replying.
>

Subject: Re: Is this kind of regression possible?

From: VIVEK

Date: 01 Dec, 2007 00:51:50

Message: 7 of 31

"Vivek " <vivek_mutalik@yahoo.com> wrote in message
<fio8nm$6k4$1@fred.mathworks.com>...
> Hi,
>
> I m having difficulty in formulating following problem. If
> you have any suggestions that'll be great.
>
> Ive set of "aligned DNA sequences" with their activities. I
> want to do regression so that i can get weights for each
> base (A,C,G,T). This may help me in understanding which
> bases are 'important and contribute' towards measured
activity.
> Example: My activity VS sequence table looks like
> (1) 08 ACAG
> (2) 10 ATTC
> (3) 05 GGTA
> (4) 04 CCGT
> (5) ... ....
> ....etc
>
> My solution would be: to minimize the residual sum of square:
> (here W is weight of that particular base, which is what im
> trying to estimate)
>
> = [8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
> +W4C)]^2 + and so on.
>
> to reduce the parameters to be determined, I can substract
> weight of 'T' from each of weights and finally add sum of
> all 'T' (as if 'T' is in all positions).
>
> That is:
>
> = [8 - (W1A-W1T + W2C-W2T + W3A-W3T + W4G-W4T) +
> (W1T+W2T+W3T+W4T)]^2
> +
> [10 - (W1A-W1T + W4C -W1T) + (W1T+W2T+W3T+W4T) ]^2
>
> + so on;
>
> Is it making any sense? So by this way, i was thinking of
> getting weights for all bases by using some kind of residual
> minimizing function. Is it possible ?

...............................................

Can it be solved by optimization toolbox ?

Subject: Re: Is this kind of regression possible?

From: Roger Stafford

Date: 01 Dec, 2007 03:02:54

Message: 8 of 31

"Vivek " <vivek_mutalik@yahoo.com> wrote in message <fio8nm$6k4
$1@fred.mathworks.com>...
> Hi,
>
> I m having difficulty in formulating following problem. If
> you have any suggestions that'll be great.
>
> Ive set of "aligned DNA sequences" with their activities. I
> want to do regression so that i can get weights for each
> base (A,C,G,T). This may help me in understanding which
> bases are 'important and contribute' towards measured activity.
> Example: My activity VS sequence table looks like
> (1) 08 ACAG
> (2) 10 ATTC
> (3) 05 GGTA
> (4) 04 CCGT
> (5) ... ....
> ....etc
>
> My solution would be: to minimize the residual sum of square:
> (here W is weight of that particular base, which is what im
> trying to estimate)
>
> = [8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
> +W4C)]^2 + and so on.
> ...SNIP...
--------
  Let’s see if I understand what you are saying, Vivek. Each DNA sequence
position is to possess four different weight values corresponding to the four
possible bases, so that with n sequence positions you would have 4*n
variables to adjust. I am assuming each of the DNA sequences in your set
has the same length, namely n. You want them to be such as to minimize the
sum of the squares:

 (W1A+W2C+W3A+W4G-8)^2 + (W1A+W2T+W3T+W4C-10)^2 + ...

in accordance with your set of aligned DNA sequences.

  If that is what you want, this can indeed be solved using the backslash
operator with no need for the Optimization Toolbox. (I have no idea what
you have in mind with the subtraction of 'T' weights to reduce the number of
parameters, but such a manipulation is not necessary to solve the problem.)
Your problem has a very definite, unique answer, provided the number of
different sequences is equal to or greater than four times the number, n, of
sequence positions.

  Write a matrix M that looks like this for your particular example with n = 4:

M = [1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0; % ACAG
     1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0; % ATTC
     0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0; % GGTA
     0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1; % CCGT
     ......

Here, the binary sequence 1 0 0 0 occurs for A, 0 1 0 0 for C, 0 0 1 0 for G,
and 0 0 0 1 for T. The column vector of "activities" would be:

A = [8;10;5;4;.....];

Then the equation

W = M\A

would provide the answer you seek, where it is understood that

W = [W1A;W1C;W1G;W1T;
     W2A;W2C;W2G;W2T;
     W3A;W3C;W3G;W3T;
     W4A;W4C;W4G;W4T];

  Note that in this case with n = 4, you would need at least 4*4 = 16
sequences to ensure a unique answer. In case there are more sequences,
matlab’s backslash operator would provide a least squares answer.

  To obtain the M matrix you would undoubtedly want some routine that
would convert whatever representation you are using for the four bases over
to the corresponding four-element binary numbers. If the above is actually
what you need, then I am sure one of us can provide such a routine if you
state how you are representing these bases in a given set of sequences (i.e.,
as ASCII characters, numbers from 1 to 4, etc.)

Roger Stafford

Subject: Re: Is this kind of regression possible?

From: Roger Stafford

Date: 01 Dec, 2007 03:05:50

Message: 9 of 31

"Vivek " <vivek_mutalik@yahoo.com> wrote in message <fio8nm$6k4
$1@fred.mathworks.com>...
> Hi,
>
> I m having difficulty in formulating following problem. If
> you have any suggestions that'll be great.
>
> Ive set of "aligned DNA sequences" with their activities. I
> want to do regression so that i can get weights for each
> base (A,C,G,T). This may help me in understanding which
> bases are 'important and contribute' towards measured activity.
> Example: My activity VS sequence table looks like
> (1) 08 ACAG
> (2) 10 ATTC
> (3) 05 GGTA
> (4) 04 CCGT
> (5) ... ....
> ....etc
>
> My solution would be: to minimize the residual sum of square:
> (here W is weight of that particular base, which is what im
> trying to estimate)
>
> = [8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
> +W4C)]^2 + and so on.
> ...SNIP...
--------
  Let’s see if I understand what you are saying, Vivek. Each DNA sequence
position is to possess four different weight values corresponding to the four
possible bases, so that with n sequence positions you would have 4*n
variables to adjust. I am assuming each of the DNA sequences in your set
has the same length, namely n. You want them to be such as to minimize the
sum of the squares:

 (W1A+W2C+W3A+W4G-8)^2 + (W1A+W2T+W3T+W4C-10)^2 + ...

in accordance with your set of aligned DNA sequences.

  If that is what you want, this can indeed be solved using the backslash
operator with no need for the Optimization Toolbox. (I have no idea what
you have in mind with the subtraction of 'T' weights to reduce the number of
parameters, but such a manipulation is not necessary to solve the problem.)
Your problem has a very definite, unique answer, provided the number of
different sequences is equal to or greater than four times the number, n, of
sequence positions.

  Write a matrix M that looks like this for your particular example with n = 4:

M = [1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0; % ACAG
     1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0; % ATTC
     0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0; % GGTA
     0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1; % CCGT
     ......

Here, the binary sequence 1 0 0 0 occurs for A, 0 1 0 0 for C, 0 0 1 0 for G,
and 0 0 0 1 for T. The column vector of "activities" would be:

A = [8;10;5;4;.....];

Then the equation

W = M\A

would provide the answer you seek, where it is understood that

W = [W1A;W1C;W1G;W1T;
     W2A;W2C;W2G;W2T;
     W3A;W3C;W3G;W3T;
     W4A;W4C;W4G;W4T];

  Note that in this case with n = 4, you would need at least 4*4 = 16
sequences to ensure a unique answer. In case there are more sequences,
matlab’s backslash operator would provide a least squares answer.

  To obtain the M matrix you would undoubtedly want some routine that
would convert whatever representation you are using for the four bases over
to the corresponding four-element binary numbers. If the above is actually
what you need, then I am sure one of us can provide such a routine if you
state how you are representing these bases in a given set of sequences (i.e.,
as ASCII characters, numbers from 1 to 4, etc.)

Roger Stafford

Subject: Re: Is this kind of regression possible?

From: Roger Stafford

Date: 01 Dec, 2007 03:05:50

Message: 10 of 31

"Vivek " <vivek_mutalik@yahoo.com> wrote in message <fio8nm$6k4
$1@fred.mathworks.com>...
> Hi,
>
> I m having difficulty in formulating following problem. If
> you have any suggestions that'll be great.
>
> Ive set of "aligned DNA sequences" with their activities. I
> want to do regression so that i can get weights for each
> base (A,C,G,T). This may help me in understanding which
> bases are 'important and contribute' towards measured activity.
> Example: My activity VS sequence table looks like
> (1) 08 ACAG
> (2) 10 ATTC
> (3) 05 GGTA
> (4) 04 CCGT
> (5) ... ....
> ....etc
>
> My solution would be: to minimize the residual sum of square:
> (here W is weight of that particular base, which is what im
> trying to estimate)
>
> = [8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
> +W4C)]^2 + and so on.
> ...SNIP...
--------
  Let’s see if I understand what you are saying, Vivek. Each DNA sequence
position is to possess four different weight values corresponding to the four
possible bases, so that with n sequence positions you would have 4*n
variables to adjust. I am assuming each of the DNA sequences in your set
has the same length, namely n. You want them to be such as to minimize the
sum of the squares:

 (W1A+W2C+W3A+W4G-8)^2 + (W1A+W2T+W3T+W4C-10)^2 + ...

in accordance with your set of aligned DNA sequences.

  If that is what you want, this can indeed be solved using the backslash
operator with no need for the Optimization Toolbox. (I have no idea what
you have in mind with the subtraction of 'T' weights to reduce the number of
parameters, but such a manipulation is not necessary to solve the problem.)
Your problem has a very definite, unique answer, provided the number of
different sequences is equal to or greater than four times the number, n, of
sequence positions.

  Write a matrix M that looks like this for your particular example with n = 4:

M = [1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0; % ACAG
     1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0; % ATTC
     0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0; % GGTA
     0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1; % CCGT
     ......

Here, the binary sequence 1 0 0 0 occurs for A, 0 1 0 0 for C, 0 0 1 0 for G,
and 0 0 0 1 for T. The column vector of "activities" would be:

A = [8;10;5;4;.....];

Then the equation

W = M\A

would provide the answer you seek, where it is understood that

W = [W1A;W1C;W1G;W1T;
     W2A;W2C;W2G;W2T;
     W3A;W3C;W3G;W3T;
     W4A;W4C;W4G;W4T];

  Note that in this case with n = 4, you would need at least 4*4 = 16
sequences to ensure a unique answer. In case there are more sequences,
matlab’s backslash operator would provide a least squares answer.

  To obtain the M matrix you would undoubtedly want some routine that
would convert whatever representation you are using for the four bases over
to the corresponding four-element binary numbers. If the above is actually
what you need, then I am sure one of us can provide such a routine if you
state how you are representing these bases in a given set of sequences (i.e.,
as ASCII characters, numbers from 1 to 4, etc.)

Roger Stafford

Subject: Re: Is this kind of regression possible?

From: Roger Stafford

Date: 01 Dec, 2007 03:05:51

Message: 11 of 31

"Vivek " <vivek_mutalik@yahoo.com> wrote in message <fio8nm$6k4
$1@fred.mathworks.com>...
> Hi,
>
> I m having difficulty in formulating following problem. If
> you have any suggestions that'll be great.
>
> Ive set of "aligned DNA sequences" with their activities. I
> want to do regression so that i can get weights for each
> base (A,C,G,T). This may help me in understanding which
> bases are 'important and contribute' towards measured activity.
> Example: My activity VS sequence table looks like
> (1) 08 ACAG
> (2) 10 ATTC
> (3) 05 GGTA
> (4) 04 CCGT
> (5) ... ....
> ....etc
>
> My solution would be: to minimize the residual sum of square:
> (here W is weight of that particular base, which is what im
> trying to estimate)
>
> = [8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
> +W4C)]^2 + and so on.
> ...SNIP...
--------
  Let’s see if I understand what you are saying, Vivek. Each DNA sequence
position is to possess four different weight values corresponding to the four
possible bases, so that with n sequence positions you would have 4*n
variables to adjust. I am assuming each of the DNA sequences in your set
has the same length, namely n. You want them to be such as to minimize the
sum of the squares:

 (W1A+W2C+W3A+W4G-8)^2 + (W1A+W2T+W3T+W4C-10)^2 + ...

in accordance with your set of aligned DNA sequences.

  If that is what you want, this can indeed be solved using the backslash
operator with no need for the Optimization Toolbox. (I have no idea what
you have in mind with the subtraction of 'T' weights to reduce the number of
parameters, but such a manipulation is not necessary to solve the problem.)
Your problem has a very definite, unique answer, provided the number of
different sequences is equal to or greater than four times the number, n, of
sequence positions.

  Write a matrix M that looks like this for your particular example with n = 4:

M = [1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0; % ACAG
     1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0; % ATTC
     0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0; % GGTA
     0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1; % CCGT
     ......

Here, the binary sequence 1 0 0 0 occurs for A, 0 1 0 0 for C, 0 0 1 0 for G,
and 0 0 0 1 for T. The column vector of "activities" would be:

A = [8;10;5;4;.....];

Then the equation

W = M\A

would provide the answer you seek, where it is understood that

W = [W1A;W1C;W1G;W1T;
     W2A;W2C;W2G;W2T;
     W3A;W3C;W3G;W3T;
     W4A;W4C;W4G;W4T];

  Note that in this case with n = 4, you would need at least 4*4 = 16
sequences to ensure a unique answer. In case there are more sequences,
matlab’s backslash operator would provide a least squares answer.

  To obtain the M matrix you would undoubtedly want some routine that
would convert whatever representation you are using for the four bases over
to the corresponding four-element binary numbers. If the above is actually
what you need, then I am sure one of us can provide such a routine if you
state how you are representing these bases in a given set of sequences (i.e.,
as ASCII characters, numbers from 1 to 4, etc.)

Roger Stafford

Subject: Re: Is this kind of regression possible?

From: Roger Stafford

Date: 01 Dec, 2007 03:05:52

Message: 12 of 31

"Vivek " <vivek_mutalik@yahoo.com> wrote in message <fio8nm$6k4
$1@fred.mathworks.com>...
> Hi,
>
> I m having difficulty in formulating following problem. If
> you have any suggestions that'll be great.
>
> Ive set of "aligned DNA sequences" with their activities. I
> want to do regression so that i can get weights for each
> base (A,C,G,T). This may help me in understanding which
> bases are 'important and contribute' towards measured activity.
> Example: My activity VS sequence table looks like
> (1) 08 ACAG
> (2) 10 ATTC
> (3) 05 GGTA
> (4) 04 CCGT
> (5) ... ....
> ....etc
>
> My solution would be: to minimize the residual sum of square:
> (here W is weight of that particular base, which is what im
> trying to estimate)
>
> = [8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
> +W4C)]^2 + and so on.
> ...SNIP...
--------
  Let’s see if I understand what you are saying, Vivek. Each DNA sequence
position is to possess four different weight values corresponding to the four
possible bases, so that with n sequence positions you would have 4*n
variables to adjust. I am assuming each of the DNA sequences in your set
has the same length, namely n. You want them to be such as to minimize the
sum of the squares:

 (W1A+W2C+W3A+W4G-8)^2 + (W1A+W2T+W3T+W4C-10)^2 + ...

in accordance with your set of aligned DNA sequences.

  If that is what you want, this can indeed be solved using the backslash
operator with no need for the Optimization Toolbox. (I have no idea what
you have in mind with the subtraction of 'T' weights to reduce the number of
parameters, but such a manipulation is not necessary to solve the problem.)
Your problem has a very definite, unique answer, provided the number of
different sequences is equal to or greater than four times the number, n, of
sequence positions.

  Write a matrix M that looks like this for your particular example with n = 4:

M = [1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0; % ACAG
     1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0; % ATTC
     0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0; % GGTA
     0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1; % CCGT
     ......

Here, the binary sequence 1 0 0 0 occurs for A, 0 1 0 0 for C, 0 0 1 0 for G,
and 0 0 0 1 for T. The column vector of "activities" would be:

A = [8;10;5;4;.....];

Then the equation

W = M\A

would provide the answer you seek, where it is understood that

W = [W1A;W1C;W1G;W1T;
     W2A;W2C;W2G;W2T;
     W3A;W3C;W3G;W3T;
     W4A;W4C;W4G;W4T];

  Note that in this case with n = 4, you would need at least 4*4 = 16
sequences to ensure a unique answer. In case there are more sequences,
matlab’s backslash operator would provide a least squares answer.

  To obtain the M matrix you would undoubtedly want some routine that
would convert whatever representation you are using for the four bases over
to the corresponding four-element binary numbers. If the above is actually
what you need, then I am sure one of us can provide such a routine if you
state how you are representing these bases in a given set of sequences (i.e.,
as ASCII characters, numbers from 1 to 4, etc.)

Roger Stafford

Subject: Re: Is this kind of regression possible?

From: Roger Stafford

Date: 01 Dec, 2007 03:06:14

Message: 13 of 31

"Vivek " <vivek_mutalik@yahoo.com> wrote in message <fio8nm$6k4
$1@fred.mathworks.com>...
> Hi,
>
> I m having difficulty in formulating following problem. If
> you have any suggestions that'll be great.
>
> Ive set of "aligned DNA sequences" with their activities. I
> want to do regression so that i can get weights for each
> base (A,C,G,T). This may help me in understanding which
> bases are 'important and contribute' towards measured activity.
> Example: My activity VS sequence table looks like
> (1) 08 ACAG
> (2) 10 ATTC
> (3) 05 GGTA
> (4) 04 CCGT
> (5) ... ....
> ....etc
>
> My solution would be: to minimize the residual sum of square:
> (here W is weight of that particular base, which is what im
> trying to estimate)
>
> = [8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
> +W4C)]^2 + and so on.
> ...SNIP...
--------
  Let’s see if I understand what you are saying, Vivek. Each DNA sequence
position is to possess four different weight values corresponding to the four
possible bases, so that with n sequence positions you would have 4*n
variables to adjust. I am assuming each of the DNA sequences in your set
has the same length, namely n. You want them to be such as to minimize the
sum of the squares:

 (W1A+W2C+W3A+W4G-8)^2 + (W1A+W2T+W3T+W4C-10)^2 + ...

in accordance with your set of aligned DNA sequences.

  If that is what you want, this can indeed be solved using the backslash
operator with no need for the Optimization Toolbox. (I have no idea what
you have in mind with the subtraction of 'T' weights to reduce the number of
parameters, but such a manipulation is not necessary to solve the problem.)
Your problem has a very definite, unique answer, provided the number of
different sequences is equal to or greater than four times the number, n, of
sequence positions.

  Write a matrix M that looks like this for your particular example with n = 4:

M = [1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0; % ACAG
     1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0; % ATTC
     0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0; % GGTA
     0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1; % CCGT
     ......

Here, the binary sequence 1 0 0 0 occurs for A, 0 1 0 0 for C, 0 0 1 0 for G,
and 0 0 0 1 for T. The column vector of "activities" would be:

A = [8;10;5;4;.....];

Then the equation

W = M\A

would provide the answer you seek, where it is understood that

W = [W1A;W1C;W1G;W1T;
     W2A;W2C;W2G;W2T;
     W3A;W3C;W3G;W3T;
     W4A;W4C;W4G;W4T];

  Note that in this case with n = 4, you would need at least 4*4 = 16
sequences to ensure a unique answer. In case there are more sequences,
matlab’s backslash operator would provide a least squares answer.

  To obtain the M matrix you would undoubtedly want some routine that
would convert whatever representation you are using for the four bases over
to the corresponding four-element binary numbers. If the above is actually
what you need, then I am sure one of us can provide such a routine if you
state how you are representing these bases in a given set of sequences (i.e.,
as ASCII characters, numbers from 1 to 4, etc.)

Roger Stafford

Subject: Re: Is this kind of regression possible?

From: Roger Stafford

Date: 01 Dec, 2007 03:06:30

Message: 14 of 31

"Vivek " <vivek_mutalik@yahoo.com> wrote in message <fio8nm$6k4
$1@fred.mathworks.com>...
> Hi,
>
> I m having difficulty in formulating following problem. If
> you have any suggestions that'll be great.
>
> Ive set of "aligned DNA sequences" with their activities. I
> want to do regression so that i can get weights for each
> base (A,C,G,T). This may help me in understanding which
> bases are 'important and contribute' towards measured activity.
> Example: My activity VS sequence table looks like
> (1) 08 ACAG
> (2) 10 ATTC
> (3) 05 GGTA
> (4) 04 CCGT
> (5) ... ....
> ....etc
>
> My solution would be: to minimize the residual sum of square:
> (here W is weight of that particular base, which is what im
> trying to estimate)
>
> = [8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
> +W4C)]^2 + and so on.
> ...SNIP...
--------
  Let’s see if I understand what you are saying, Vivek. Each DNA sequence
position is to possess four different weight values corresponding to the four
possible bases, so that with n sequence positions you would have 4*n
variables to adjust. I am assuming each of the DNA sequences in your set
has the same length, namely n. You want them to be such as to minimize the
sum of the squares:

 (W1A+W2C+W3A+W4G-8)^2 + (W1A+W2T+W3T+W4C-10)^2 + ...

in accordance with your set of aligned DNA sequences.

  If that is what you want, this can indeed be solved using the backslash
operator with no need for the Optimization Toolbox. (I have no idea what
you have in mind with the subtraction of 'T' weights to reduce the number of
parameters, but such a manipulation is not necessary to solve the problem.)
Your problem has a very definite, unique answer, provided the number of
different sequences is equal to or greater than four times the number, n, of
sequence positions.

  Write a matrix M that looks like this for your particular example with n = 4:

M = [1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0; % ACAG
     1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0; % ATTC
     0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0; % GGTA
     0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1; % CCGT
     ......

Here, the binary sequence 1 0 0 0 occurs for A, 0 1 0 0 for C, 0 0 1 0 for G,
and 0 0 0 1 for T. The column vector of "activities" would be:

A = [8;10;5;4;.....];

Then the equation

W = M\A

would provide the answer you seek, where it is understood that

W = [W1A;W1C;W1G;W1T;
     W2A;W2C;W2G;W2T;
     W3A;W3C;W3G;W3T;
     W4A;W4C;W4G;W4T];

  Note that in this case with n = 4, you would need at least 4*4 = 16
sequences to ensure a unique answer. In case there are more sequences,
matlab’s backslash operator would provide a least squares answer.

  To obtain the M matrix you would undoubtedly want some routine that
would convert whatever representation you are using for the four bases over
to the corresponding four-element binary numbers. If the above is actually
what you need, then I am sure one of us can provide such a routine if you
state how you are representing these bases in a given set of sequences (i.e.,
as ASCII characters, numbers from 1 to 4, etc.)

Roger Stafford

Subject: Re: Is this kind of regression possible?

From: vicky

Date: 01 Dec, 2007 03:28:32

Message: 15 of 31

"Roger Stafford" <ellieandrogerxyzzy@mindspring.com.invalid>
wrote in message <fiqisu$l62$1@fred.mathworks.com>...
> "Vivek " <vivek_mutalik@yahoo.com> wrote in message
<fio8nm$6k4
> $1@fred.mathworks.com>...
> > Hi,
> >
> > I m having difficulty in formulating following problem. If
> > you have any suggestions that'll be great.
> >
> > Ive set of "aligned DNA sequences" with their activities. I
> > want to do regression so that i can get weights for each
> > base (A,C,G,T). This may help me in understanding which
> > bases are 'important and contribute' towards measured
activity.
> > Example: My activity VS sequence table looks like
> > (1) 08 ACAG
> > (2) 10 ATTC
> > (3) 05 GGTA
> > (4) 04 CCGT
> > (5) ... ....
> > ....etc
> >
> > My solution would be: to minimize the residual sum of
square:
> > (here W is weight of that particular base, which is what im
> > trying to estimate)
> >
> > = [8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
> > +W4C)]^2 + and so on.
> > ...SNIP...
> --------
> Let’s see if I understand what you are saying, Vivek.
Each DNA sequence
> position is to possess four different weight values
corresponding to the four
> possible bases, so that with n sequence positions you
would have 4*n
> variables to adjust. I am assuming each of the DNA
sequences in your set
> has the same length, namely n. You want them to be such
as to minimize the
> sum of the squares:
>
> (W1A+W2C+W3A+W4G-8)^2 + (W1A+W2T+W3T+W4C-10)^2 + ...
>
> in accordance with your set of aligned DNA sequences.
>
> If that is what you want, this can indeed be solved
using the backslash
> operator with no need for the Optimization Toolbox. (I
have no idea what
> you have in mind with the subtraction of 'T' weights to
reduce the number of
> parameters, but such a manipulation is not necessary to
solve the problem.)
> Your problem has a very definite, unique answer, provided
the number of
> different sequences is equal to or greater than four times
the number, n, of
> sequence positions.
>
> Write a matrix M that looks like this for your
particular example with n = 4:
>
> M = [1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0; % ACAG
> 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0; % ATTC
> 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0; % GGTA
> 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1; % CCGT
> ......
>
> Here, the binary sequence 1 0 0 0 occurs for A, 0 1 0 0
for C, 0 0 1 0 for G,
> and 0 0 0 1 for T. The column vector of "activities"
would be:
>
> A = [8;10;5;4;.....];
>
> Then the equation
>
> W = M\A
>
> would provide the answer you seek, where it is understood that
>
> W = [W1A;W1C;W1G;W1T;
> W2A;W2C;W2G;W2T;
> W3A;W3C;W3G;W3T;
> W4A;W4C;W4G;W4T];
>
> Note that in this case with n = 4, you would need at
least 4*4 = 16
> sequences to ensure a unique answer. In case there are
more sequences,
> matlab’s backslash operator would provide a least squares
answer.
>
> To obtain the M matrix you would undoubtedly want some
routine that
> would convert whatever representation you are using for
the four bases over
> to the corresponding four-element binary numbers. If the
above is actually
> what you need, then I am sure one of us can provide such a
routine if you
> state how you are representing these bases in a given set
of sequences (i.e.,
> as ASCII characters, numbers from 1 to 4, etc.)
>
> Roger Stafford
>
------------------------
Hi Roger,

Thanks very much for responding to my query and offering to
help me out.

I must say, You have understood my question correctly. I
appreciate your solution to the problem. Ive tried that.
taken just binary 1 or 0 representation for each base
instead of 1000, 0100, type. Im thankful to Walter Roberson
who gave me that routine.

Most important issue with my result was that my matrix X,
which describes these binary representations is not full
rank. I think some columns are "collinear (or
multicollinear)". Matrix Y is my activity vector. So i was
trying to use PLSR to avoid this collinearity and to include
full DNA sequence (about 50 letter length), which was also
not successfull due to lack of correlations. Im still
working on that.

Meanwhile, I was thinking of why not just take all equations
in one residual sum of square equation and solve it for
smaller segments of DNA (Not sure if optimization toolbox
would be helpful in that). I was trying to eliminate 'T' so
that i can have less variables to handle.
Someone suggested that this may be easy in mathematica. I
dont understand how will that be. If it is easy, then Matlab
will be my choice.
Thanks again. Do send your comments.
Regards,
Vivek

Subject: Re: Is this kind of regression possible?

From: Roger Stafford

Date: 01 Dec, 2007 04:12:44

Message: 16 of 31

"vicky " <vivek_mutalik@yahoo.com> wrote in message <fiqkd0$a9h
$1@fred.mathworks.com>...
> Hi Roger,
>
> Thanks very much for responding to my query and offering to
> help me out.
>
> I must say, You have understood my question correctly. I
> appreciate your solution to the problem. Ive tried that.
> taken just binary 1 or 0 representation for each base
> instead of 1000, 0100, type. Im thankful to Walter Roberson
> who gave me that routine.
>
> Most important issue with my result was that my matrix X,
> which describes these binary representations is not full
> rank. I think some columns are "collinear (or
> multicollinear)". Matrix Y is my activity vector. So i was
> trying to use PLSR to avoid this collinearity and to include
> full DNA sequence (about 50 letter length), which was also
> not successfull due to lack of correlations. Im still
> working on that.
>
> Meanwhile, I was thinking of why not just take all equations
> in one residual sum of square equation and solve it for
> smaller segments of DNA (Not sure if optimization toolbox
> would be helpful in that). I was trying to eliminate 'T' so
> that i can have less variables to handle.
> Someone suggested that this may be easy in mathematica. I
> dont understand how will that be. If it is easy, then Matlab
> will be my choice.
> Thanks again. Do send your comments.
> Regards,
> Vivek
--------
  First of all, my apologies for the multiplicity of copies of my previous reply; I
think there were seven in all, much to my disgust. I waited for several
minutes to elicit a response from the MathWorks' newsreader and then
clicked the "post message" button again. After many minutes I clicked once
more, but I can't imagine where seven copies came from – unless possibly my
mouse click bounced repeatedly.

  Back to the subject at hand, when you said "taken just binary 1 or 0
representation for each base instead of 1000, 0100, type" you gave me the
impression that you interpreted the sequences like 1 0 0 0 or 0 1 0 0 as
being single binary scalars. That isn't what I meant. These are to be four
distinct elements with each element a 0 or 1. This makes M have 4*n
columns and W have 4*n rows.

  I don't understand what you mean in your final paragraph, "Meanwhile, I was
thinking of why not just take all equations in one residual sum of square
equation and solve it for smaller segments of DNA". I see no particular
reason for taking smaller segments. What is there to gain by that?

Roger Stafford

Subject: Re: Is this kind of regression possible?

From: vicky

Date: 01 Dec, 2007 16:19:41

Message: 17 of 31

"Roger Stafford"
<ellieandrogerxyzzy@mindspring.com.invalid> wrote in
message <fiqmvs$c6o$1@fred.mathworks.com>...
> "vicky " <vivek_mutalik@yahoo.com> wrote in message
<fiqkd0$a9h
> $1@fred.mathworks.com>...
> > Hi Roger,
> >
> > Thanks very much for responding to my query and
offering to
> > help me out.
> >
> > I must say, You have understood my question correctly.
I
> > appreciate your solution to the problem. Ive tried
that.
> > taken just binary 1 or 0 representation for each base
> > instead of 1000, 0100, type. Im thankful to Walter
Roberson
> > who gave me that routine.
> >
> > Most important issue with my result was that my matrix
X,
> > which describes these binary representations is not
full
> > rank. I think some columns are "collinear (or
> > multicollinear)". Matrix Y is my activity vector. So i
was
> > trying to use PLSR to avoid this collinearity and to
include
> > full DNA sequence (about 50 letter length), which was
also
> > not successfull due to lack of correlations. Im still
> > working on that.
> >
> > Meanwhile, I was thinking of why not just take all
equations
> > in one residual sum of square equation and solve it for
> > smaller segments of DNA (Not sure if optimization
toolbox
> > would be helpful in that). I was trying to
eliminate 'T' so
> > that i can have less variables to handle.
> > Someone suggested that this may be easy in
mathematica. I
> > dont understand how will that be. If it is easy, then
Matlab
> > will be my choice.
> > Thanks again. Do send your comments.
> > Regards,
> > Vivek
> --------
> First of all, my apologies for the multiplicity of
copies of my previous reply; I
> think there were seven in all, much to my disgust. I
waited for several
> minutes to elicit a response from the MathWorks'
newsreader and then
> clicked the "post message" button again. After many
minutes I clicked once
> more, but I can't imagine where seven copies came from –
unless possibly my
> mouse click bounced repeatedly.
>
> Back to the subject at hand, when you said "taken just
binary 1 or 0
> representation for each base instead of 1000, 0100,
type" you gave me the
> impression that you interpreted the sequences like 1 0 0
0 or 0 1 0 0 as
> being single binary scalars. That isn't what I meant.
These are to be four
> distinct elements with each element a 0 or 1. This
makes M have 4*n
> columns and W have 4*n rows.
>
> I don't understand what you mean in your final
paragraph, "Meanwhile, I was
> thinking of why not just take all equations in one
residual sum of square
> equation and solve it for smaller segments of DNA". I
see no particular
> reason for taking smaller segments. What is there to
gain by that?
>
> Roger Stafford
>-----------------------------------------

Thanks for ur reply.
I know what you meant. I`ve taken distinct zero's and
one's for each element. That is generating 1 or 0 for each
A, C G and T. So I have 4*N combinations, where N is
length of elements. My matrix looks like:

     A C G T A C G T A C G T A C G T
M = [1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0; % ACAG
     1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0; % ATTC
     0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0; % GGTA
     0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1; % CCGT
     ......
In this case I generate binaries for Full length of 50
letters per row (not just 4). This can be only solved by
PLSR or PCA. But if use only 4 elements (16 columns)like
in this matrix, then i can use backlash (regress) function
to solve, as my to total number of rows is more than 40.
So this matrix if 40*16 matrix for regress.

meanwhile, to see whether i can solve this "one equation
of residual sum of sqare" i take smaller segments,
1. To see whether 'activity' (response) can be explained
by such smaller segments (rather than full 50 elements)
2. IF i take more variables (elements, columns) than
response (activities) then i'll have multiple solutions.
Since in this case rows < columns.

Whole idea is to get "weights" per elements which should
desribe my activity. Do you have any alternate suggestions
for this than what im trying to do?


Subject: Re: Is this kind of regression possible?

From: Roger Stafford

Date: 01 Dec, 2007 19:23:54

Message: 18 of 31

"vicky " <vivek_mutalik@yahoo.com> wrote in message <fis1it$ldn
$1@fred.mathworks.com>...
> Thanks for ur reply.
> I know what you meant. I`ve taken distinct zero's and
>...SNIP...
-----------
  If I understand what you are trying to accomplish, Vivek, then in my opinion
it makes no sense for the number of DNA sequences in the analysis to be less
than the number of parameters, which is to say, four times the length of
those sequences. To attempt to do otherwise is to obtain meaningless
results, no matter what method is used, since there would be too many
parameters available for adjustment for the given amount of information
furnished by the observed "activities" (responses.)

  To be using sequences with 50 DNA bases would provide 200 adjustable
weight parameters of the kind you have described, and accordingly you ought
to have at least 200 different sequences present in your analysis to give
significance to the results.

  With an adequate number of sequences, matlab's backslash method is
designed to solve just that kind of problem, since it is really a standard
problem in regression when expressed in terms of the binary matrix we have
discussed.

  This binary matrix was not an arbitrary concoction on my part. It is dictated
by the least squares problem you posed; there is really no choice in the
matter.

Roger Stafford

Subject: Re: Is this kind of regression possible?

From: vicky

Date: 01 Dec, 2007 19:49:45

Message: 19 of 31

> With an adequate number of sequences, matlab's
backslash method is
> designed to solve just that kind of problem, since it is
really a standard
> problem in regression when expressed in terms of the
binary matrix we have
> discussed.

Yes. backslash requires adequate number of sequences to
get a meaningful result. However, Princile component or
partial least square regression will handle any kind of
matrix, since they can project the data to orthogonal
latent variables.
Someone suggested me this problem can 'also be solved if I
take smaller segments of DNA' and try to get a single
equation for 'minimizing residual sum of squares'. no
success yet. Thanks for ur reply.

Subject: Re: Is this kind of regression possible?

From: Bruno Luong

Date: 01 Dec, 2007 21:24:23

Message: 20 of 31

I'm not following this thread very closely (read diagonally,
it seems more to be a formalism problem rather than
mathematics or MATLAB).

But in case of over-determine system (more unknown than
data), using the backslash operator is rather risky. In
fact, the solution returned by "\" will have appropriate
number of zeros (number of unknown minus matrix rank), and
the position of these zeros are rather randomly set.

In the case of overdertermine linear system, using pinv() is
a better choice, because W=pinv(M)*Y is well defined. It's
the unique solution that verifies the two following conditions:

(i) M*W = Y
(ii) |W| is minimum among all W that satisfied (i).

Bruno



Subject: Re: Is this kind of regression possible?

From: Bruno Luong

Date: 01 Dec, 2007 21:29:13

Message: 21 of 31

Ohhh I have a minor request. Can I ask for participants to
quote the text only when it's necessary, and if the quoting
is desired, then quote only the relevant part? Otherwise
it's very hard to read. It just encourage me to skip reading
such thread.

Bruno

Subject: Re: Is this kind of regression possible?

From: vicky

Date: 02 Dec, 2007 17:11:32

Message: 22 of 31

Thanks for responding. I'll try ur suggestion.
I agree with you regarding replies with no quotes.

Thanks again
Vivek

Subject: Re: Is this kind of regression possible?

From: vicky

Date: 02 Jan, 2008 07:00:10

Message: 23 of 31

 
For following equation:

 [8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
 +W4C)]^2 + and so on.

Can i give initial guess for W1A, W2C, W2T, W3A,W3T,W4C and
W4G and solve this equation to obtain their values after
minimizing the residual sum of square ?

For example using 'lsqnonlin' (nonlinear least square
function)i can get their values. Is there any similar
Function in Linear regression?

Thanks.

Subject: Re: Is this kind of regression possible?

From: Bruno Luong

Date: 02 Jan, 2008 22:59:29

Message: 24 of 31

"vicky " <vivek_mutalik@yahoo.com> wrote in message
<flfcpq$nf8$1@fred.mathworks.com>...
>
> For following equation:
>
> [8 - (W1A + W2C + W3A + W4G)]^2 + [10 - (W1A + W2T + W3T
> +W4C)]^2 + and so on.
>
> Can i give initial guess for W1A, W2C, W2T, W3A,W3T,W4C and
> W4G and solve this equation to obtain their values after
> minimizing the residual sum of square ?
>

Yes! As I wrote above the following:

[ In the case of overdetermined linear system, using pinv()
is a better choice, because W=pinv(M)*Y is well defined.
It's the unique solution that verifies the two following
conditions:

(i) M*W = Y
(ii) |W| is minimum among all W that satisfied (i). ]

Gradient minimization algorithm - without preconditioning -
when fully converged, provides the solution (among the set
of solutions) which minimizes the L2^norm to the first-guess.

The algebric solution W=pinv(M)*Y is what one obtains with a
gradient algorithm starting from 0 as first-guess.

If you want an equivalent algebric solution to a
minimization from a first guess W0, do the following:

W = W0 + pinv(M)*(Y-M*W0)

Bruno

Subject: Re: Is this kind of regression possible?

From: vicky

Date: 02 Jan, 2008 23:40:32

Message: 25 of 31

"Bruno Luong"
> The algebric solution W=pinv(M)*Y is what one obtains with a
> gradient algorithm starting from 0 as first-guess.
>
> If you want an equivalent algebric solution to a
> minimization from a first guess W0, do the following:
>
> W = W0 + pinv(M)*(Y-M*W0)
>

What's 'M' here ? Wo is initial guess for nucleotides ; Y is
my response (activity);


Subject: Re: Is this kind of regression possible?

From: Bruno Luong

Date: 03 Jan, 2008 00:03:36

Message: 26 of 31

"vicky " <vivek_mutalik@yahoo.com> wrote in message
<flh7dg$8dc$1@fred.mathworks.com>...

> >
>
> What's 'M' here ?

I simply follow the notation *you* have used: M is the model
matrix.

Bruno

Subject: Re: Is this kind of regression possible?

From: vicky

Date: 03 Jan, 2008 00:42:48

Message: 27 of 31

Sorry Bruno,

In this problem im not using matrix M with binaries.

Ive only Y (activity) and W's. If i give guessed values to
all W's then i can use 'lsqnonlin' to obtain new coefficient
values (W's) with respect to their activity Y for nonlinear
least-squares problem. I was just thinking is there a method
to do same for linear least-squares problem.
For detailed problem description Please refer to my first
post on this page at the top. Thanks,

Subject: Re: Is this kind of regression possible?

From: Bruno Luong

Date: 03 Jan, 2008 07:45:26

Message: 28 of 31

"vicky " <vivek_mutalik@yahoo.com> wrote in message
<flhb28$1n7$1@fred.mathworks.com>...
>
> Ive only Y (activity) and W's. If i give guessed values to
> all W's then i can use 'lsqnonlin' to obtain new coefficient
> values (W's) with respect to their activity Y for nonlinear
> least-squares problem. I was just thinking is there a method
> to do same for linear least-squares problem.

'lsqnonlin' is an *iterative solver* can be used for
non-linear AND linear least-square problem.

The algebric formula I gave is a *direct method* for linear
least-square.

Both are able to take into account the first guess.

What is missing for you?

Bruno

Subject: Re: Is this kind of regression possible?

From: vicky

Date: 03 Jan, 2008 17:55:58

Message: 29 of 31

Thanks.
> 'lsqnonlin' is an *iterative solver* can be used for
> non-linear AND linear least-square problem.

I didnt know this. Documentation doesn't mention this.
Thanks.
 
> What is missing for you?
I hope ur asking me what is the problem?
Im trying to obtain weights for each nucleotide at each
position for measured activity. I can use pinv or backslash
as discussed above if i use binary notation for describing
my (matrix M) set of sequences. Another way can be to
minimize residual sum of square from initial guessed values
for each weight. Not sure im answering ur question.
Thanks again.

Subject: Re: Is this kind of regression possible?

From: vicky

Date: 04 Jan, 2008 07:44:57

Message: 30 of 31

I tried ur suggestions with pinv. yes, it works in my case.
thanks again.

Are u sure lsqnonlin can be used for linear least square
problem also?


Subject: Re: Is this kind of regression possible?

From: Bruno Luong

Date: 04 Jan, 2008 08:08:53

Message: 31 of 31