Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
Encoding DNA sequence into binary

Subject: Encoding DNA sequence into binary

From: VIVEK

Date: 12 Nov, 2007 20:30:13

Message: 1 of 9

I am trying to encode DNA sequence in to binary form. Please
share If any one has good solution.

these are list of aligned DNA sequences..

Example:position 123456....
(sequence 1) ACCTGA.....
(sequence 2) ACGAGC....

each position has 4 options, A C G T. if particular letter
present, then its 1 or it'll be zero. for example, the
Ouput looks like:
position 1A 1C 1G 1T 2A 2C 2G 2T 3A 3C 3G 3T ....
(sequence 1) 1 0 0 0 0 1 0 0 0 1 0 0....
(sequence 2) 1 0 0 0 0 1 0 0 0 0 1 0....

at the end, i'd like to delete columns with only one's or
zero's.

Thanks.

Subject: Encoding DNA sequence into binary

From: roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson)

Date: 12 Nov, 2007 21:51:53

Message: 2 of 9

In article <fhad4l$n89$1@fred.mathworks.com>,
VIVEK <vivek_mutalik@yahoo.com> wrote:
>I am trying to encode DNA sequence in to binary form. Please
>share If any one has good solution.

>these are list of aligned DNA sequences..

>Example:position 123456....
>(sequence 1) ACCTGA.....
>(sequence 2) ACGAGC....

>each position has 4 options, A C G T. if particular letter
>present, then its 1 or it'll be zero. for example, the
>Ouput looks like:
>position 1A 1C 1G 1T 2A 2C 2G 2T 3A 3C 3G 3T ....
>(sequence 1) 1 0 0 0 0 1 0 0 0 1 0 0....
>(sequence 2) 1 0 0 0 0 1 0 0 0 0 1 0....

>at the end, i'd like to delete columns with only one's or
>zero's.

The last bit doesn't make sense to me. If you delete the columns
with only one's or zero's, then you are more or less generating

  sequence1 ~= sequence2

except that each remaining position will have 1 0 if the
letter from the first sequence is before the letter of the second
sequence, and will have 0 1 if the letter from the second sequence is
before the letter of the first sequence. Indeed, you could code,

A = sequence1 < sequence2;
B = sequence2 < sequence1;
C = [A;B];
D = C(XOR(A,B),:);

I can't say that I see the value of this representation.
--
We regret to announce that sub-millibarn resolution bio-hyperdimensional
plasmatic space polyimaging has been delayed until the release
of Windows Vista SP2.

Subject: Encoding DNA sequence into binary

From: VIVEK

Date: 12 Nov, 2007 23:18:21

Message: 3 of 9

Thanks for your reply. I get your point. Sorry for giving
such a lousy example. The example i gave is purely
hypothetical.
I have 55 letter strings, so each string will have 55*4 =
140 columns of 0's,1's for each sequence. Primarily, im
processing this information as an input to machine learning
algorithms, especially regression techniques.
   


roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson) wrote in
message <fhahto$6q$1@canopus.cc.umanitoba.ca>...
> In article <fhad4l$n89$1@fred.mathworks.com>,
> VIVEK <vivek_mutalik@yahoo.com> wrote:
> >I am trying to encode DNA sequence in to binary form. Please
> >share If any one has good solution.
>
> >these are list of aligned DNA sequences..
>
> >Example:position 123456....
> >(sequence 1) ACCTGA.....
> >(sequence 2) ACGAGC....
>
> >each position has 4 options, A C G T. if particular letter
> >present, then its 1 or it'll be zero. for example, the
> >Ouput looks like:
> >position 1A 1C 1G 1T 2A 2C 2G 2T 3A 3C 3G 3T ....
> >(sequence 1) 1 0 0 0 0 1 0 0 0 1 0 0....
> >(sequence 2) 1 0 0 0 0 1 0 0 0 0 1 0....
>
> >at the end, i'd like to delete columns with only one's or
> >zero's.
>
> The last bit doesn't make sense to me. If you delete the
columns
> with only one's or zero's, then you are more or less
generating
>
> sequence1 ~= sequence2
>
> except that each remaining position will have 1 0 if the
> letter from the first sequence is before the letter of the
second
> sequence, and will have 0 1 if the letter from the second
sequence is
> before the letter of the first sequence. Indeed, you could
code,
>
> A = sequence1 < sequence2;
> B = sequence2 < sequence1;
> C = [A;B];
> D = C(XOR(A,B),:);
>
> I can't say that I see the value of this representation.
> --
> We regret to announce that sub-millibarn resolution
bio-hyperdimensional
> plasmatic space polyimaging has been delayed until the release
> of Windows Vista SP2.

Subject: Encoding DNA sequence into binary

From: roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson)

Date: 13 Nov, 2007 00:01:50

Message: 4 of 9

In article <fhamvt$1bf$1@fred.mathworks.com>,
VIVEK <vivek_mutalik@yahoo.com> top-posted:

Please do not post your reply above the material you are commenting
on: it makes it more difficult to hold a conversation.

>roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson) wrote in
>message <fhahto$6q$1@canopus.cc.umanitoba.ca>...
>> In article <fhad4l$n89$1@fred.mathworks.com>,
>> VIVEK <vivek_mutalik@yahoo.com> wrote:
>> >I am trying to encode DNA sequence in to binary form.

>> >each position has 4 options, A C G T. if particular letter
>> >present, then its 1 or it'll be zero. for example, the
>> >Ouput looks like:
>> >position 1A 1C 1G 1T 2A 2C 2G 2T 3A 3C 3G 3T ....
>> >(sequence 1) 1 0 0 0 0 1 0 0 0 1 0 0....
>> >(sequence 2) 1 0 0 0 0 1 0 0 0 0 1 0....

>> >at the end, i'd like to delete columns with only one's or
>> >zero's.

>> The last bit doesn't make sense to me. If you delete the
>columns
>> with only one's or zero's, then you are more or less
>generating
>> sequence1 ~= sequence2

>> Indeed, you could code,
>> A = sequence1 < sequence2;
>> B = sequence2 < sequence1;
>> C = [A;B];
>> D = C(XOR(A,B),:);


>I have 55 letter strings, so each string will have 55*4 =
>140 columns of 0's,1's for each sequence.

Ummm, are you perchance saying that there will be one
row -per sequence- ?

Let sequences() be a char array, N x 55, so
sequences(K,:) is the string corresponding to the K'th sequence.

t = false(size(sequences,1),4,size(sequences,2));
t(:,1,:) = sequences == 'A';
t(:,2,:) = sequences == 'C';
t(:,3,:) = sequences == 'G';
t(:,4,:) = sequences == 'T';
u = reshape(t,size(sequences,1),4*size(sequences,2));
tokeep = ~all(u) & ~all(~u);
final = u(:,tokeep);

(The above code has been tested.)
--
   "History is a pile of debris" -- Laurie Anderson

Subject: Encoding DNA sequence into binary

From: VIVEK

Date: 13 Nov, 2007 07:22:18

Message: 5 of 9

roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson) wrote in
message <fhaphe$8tt$1@canopus.cc.umanitoba.ca>...
> In article <fhamvt$1bf$1@fred.mathworks.com>,
> VIVEK <vivek_mutalik@yahoo.com> top-posted:
>
> Please do not post your reply above the material you are
commenting
> on: it makes it more difficult to hold a conversation.
>
> >roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson) wrote in
> >message <fhahto$6q$1@canopus.cc.umanitoba.ca>...
> >> In article <fhad4l$n89$1@fred.mathworks.com>,
> >> VIVEK <vivek_mutalik@yahoo.com> wrote:
> >> >I am trying to encode DNA sequence in to binary form.
>
> >> >each position has 4 options, A C G T. if particular letter
> >> >present, then its 1 or it'll be zero. for example, the
> >> >Ouput looks like:
> >> >position 1A 1C 1G 1T 2A 2C 2G 2T 3A 3C 3G 3T ....
> >> >(sequence 1) 1 0 0 0 0 1 0 0 0 1 0 0....
> >> >(sequence 2) 1 0 0 0 0 1 0 0 0 0 1 0....
>
> >> >at the end, i'd like to delete columns with only one's or
> >> >zero's.
>
> >> The last bit doesn't make sense to me. If you delete the
> >columns
> >> with only one's or zero's, then you are more or less
> >generating
> >> sequence1 ~= sequence2
>
> >> Indeed, you could code,
> >> A = sequence1 < sequence2;
> >> B = sequence2 < sequence1;
> >> C = [A;B];
> >> D = C(XOR(A,B),:);
>
>
> >I have 55 letter strings, so each string will have 55*4 =
> >140 columns of 0's,1's for each sequence.
>
> Ummm, are you perchance saying that there will be one
> row -per sequence- ?
>
> Let sequences() be a char array, N x 55, so
> sequences(K,:) is the string corresponding to the K'th
sequence.
>
> t = false(size(sequences,1),4,size(sequences,2));
> t(:,1,:) = sequences == 'A';
> t(:,2,:) = sequences == 'C';
> t(:,3,:) = sequences == 'G';
> t(:,4,:) = sequences == 'T';
> u = reshape(t,size(sequences,1),4*size(sequences,2));
> tokeep = ~all(u) & ~all(~u);
> final = u(:,tokeep);
>
> (The above code has been tested.)
> --
> "History is a pile of debris" --
Laurie Anderson
-------------------------------------
(Sorry for my comments on top of messages)
Thanks for the code.
Unfortunately,Im getting this error... on ..
t(:,1,:) = sequences == 'A';

??? Undefined function or method 'eq' for input arguments of
type 'cell'.

Subject: Encoding DNA sequence into binary

From: VIVEK

Date: 13 Nov, 2007 07:52:17

Message: 6 of 9

roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson) wrote in
message <fhaphe$8tt$1@canopus.cc.umanitoba.ca>...
> In article <fhamvt$1bf$1@fred.mathworks.com>,
> VIVEK <vivek_mutalik@yahoo.com> top-posted:
>
> Please do not post your reply above the material you are
commenting
> on: it makes it more difficult to hold a conversation.
>
> >roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson) wrote in
> >message <fhahto$6q$1@canopus.cc.umanitoba.ca>...
> >> In article <fhad4l$n89$1@fred.mathworks.com>,
> >> VIVEK <vivek_mutalik@yahoo.com> wrote:
> >> >I am trying to encode DNA sequence in to binary form.
>
> >> >each position has 4 options, A C G T. if particular letter
> >> >present, then its 1 or it'll be zero. for example, the
> >> >Ouput looks like:
> >> >position 1A 1C 1G 1T 2A 2C 2G 2T 3A 3C 3G 3T ....
> >> >(sequence 1) 1 0 0 0 0 1 0 0 0 1 0 0....
> >> >(sequence 2) 1 0 0 0 0 1 0 0 0 0 1 0....
>
> >> >at the end, i'd like to delete columns with only one's or
> >> >zero's.
>
> >> The last bit doesn't make sense to me. If you delete the
> >columns
> >> with only one's or zero's, then you are more or less
> >generating
> >> sequence1 ~= sequence2
>
> >> Indeed, you could code,
> >> A = sequence1 < sequence2;
> >> B = sequence2 < sequence1;
> >> C = [A;B];
> >> D = C(XOR(A,B),:);
>
>
> >I have 55 letter strings, so each string will have 55*4 =
> >140 columns of 0's,1's for each sequence.
>
> Ummm, are you perchance saying that there will be one
> row -per sequence- ?
>
> Let sequences() be a char array, N x 55, so
> sequences(K,:) is the string corresponding to the K'th
sequence.
>
> t = false(size(sequences,1),4,size(sequences,2));
> t(:,1,:) = sequences == 'A';
> t(:,2,:) = sequences == 'C';
> t(:,3,:) = sequences == 'G';
> t(:,4,:) = sequences == 'T';
> u = reshape(t,size(sequences,1),4*size(sequences,2));
> tokeep = ~all(u) & ~all(~u);
> final = u(:,tokeep);
>
> (The above code has been tested.)
> --
> "History is a pile of debris" --
Laurie Anderson


Ignore my earlier post...this works like magic !!

Didnt understand code line....
tokeep = ~all(u) & ~all(~u);

THANKS!!

Subject: Encoding DNA sequence into binary

From: roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson)

Date: 13 Nov, 2007 17:13:28

Message: 7 of 9

In article <fhbl3h$iho$1@fred.mathworks.com>,
VIVEK <vivek_mutalik@yahoo.com> wrote:
>roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson) wrote in
>message <fhaphe$8tt$1@canopus.cc.umanitoba.ca>...
>> In article <fhamvt$1bf$1@fred.mathworks.com>,
>> VIVEK <vivek_mutalik@yahoo.com> top-posted:


>> >> >at the end, i'd like to delete columns with only one's or
>> >> >zero's.

>> u = reshape(t,size(sequences,1),4*size(sequences,2));
>> tokeep = ~all(u) & ~all(~u);
>> final = u(:,tokeep);

>Didnt understand code line....
>tokeep = ~all(u) & ~all(~u);

That and the next line are what implement the trailing requirement
to delete columns with only one's or zero's.
tokeep is finding the columns to keep, which are the columns
that do -not- contain all 1's or all 0's.

all(u) runs down the columns of u checking to see if all of the
values are true (all 1's.) Any such column is one we do -not-
want to keep, so ~all(u) is selecting -against- columns of all-1's.
Similarily, ~all(~u) is selecting against columns of all-0's.

This code is equivilent and might be slightly more obvious:

tokeep = ~(all(u) | all(~u))

--
   "No one has the right to destroy another person's belief by
   demanding empirical evidence." -- Ann Landers

Subject: Encoding DNA sequence into binary

From: Alan Addison

Date: 23 Dec, 2007 01:46:23

Message: 8 of 9

In combining binary code. I take two of one and one of
another. Add them as if they are in base 4. Once in base 4
convert back to binary. Does that make any sense?
Let Me Know...

 
"VIVEK " <vivek_mutalik@yahoo.com> wrote in message
<fhad4l$n89$1@fred.mathworks.com>...
> I am trying to encode DNA sequence in to binary form.
Please
> share If any one has good solution.
>
> these are list of aligned DNA sequences..
>
> Example:position 123456....
> (sequence 1) ACCTGA.....
> (sequence 2) ACGAGC....
>
> each position has 4 options, A C G T. if particular letter
> present, then its 1 or it'll be zero. for example, the
> Ouput looks like:
> position 1A 1C 1G 1T 2A 2C 2G 2T 3A 3C 3G 3T ....
> (sequence 1) 1 0 0 0 0 1 0 0 0 1 0 0....
> (sequence 2) 1 0 0 0 0 1 0 0 0 0 1 0....
>
> at the end, i'd like to delete columns with only one's or
> zero's.
>
> Thanks.

Subject: Encoding DNA sequence into binary

From: dmitsinikos@gmail.com

Date: 16 Nov, 2013 11:10:31

Message: 9 of 9

On Monday, November 12, 2007 8:30:13 PM UTC, VIVEK wrote:
> I am trying to encode DNA sequence in to binary form. Please
> share If any one has good solution.
>
> these are list of aligned DNA sequences..
>
> Example:position 123456....
> (sequence 1) ACCTGA.....
> (sequence 2) ACGAGC....
>
> each position has 4 options, A C G T. if particular letter
> present, then its 1 or it'll be zero. for example, the
> Ouput looks like:
> position 1A 1C 1G 1T 2A 2C 2G 2T 3A 3C 3G 3T ....
> (sequence 1) 1 0 0 0 0 1 0 0 0 1 0 0....
> (sequence 2) 1 0 0 0 0 1 0 0 0 0 1 0....
>
> at the end, i'd like to delete columns with only one's or
> zero's.
>
> Thanks.



The best way is to convert each nucleotide to 2 binary positions. I actually think the genetic code is written in a binary digital format: http://dna.mitsinikos.net

Tags for this Thread

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us