Got Questions? Get Answers.
Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
word occurence counting (DNA)

Subject: word occurence counting (DNA)

From: ambrosia nightwish

Date: 31 Mar, 2010 21:36:20

Message: 1 of 22

Hello,
how can i calculate the frequencies of appearance of the words with K-length walking through a DNA sequence, I want to find the frequency for each path i walk

Subject: word occurence counting (DNA)

From: us

Date: 31 Mar, 2010 21:45:20

Message: 2 of 22

"ambrosia nightwish" <mess_imen@yahoo.fr> wrote in message <hp0f8k$snf$1@fred.mathworks.com>...
> Hello,
> how can i calculate the frequencies of appearance of the words with K-length walking through a DNA sequence, I want to find the frequency for each path i walk

what have YOU done so far to solve YOUR particular problem...

us

Subject: word occurence counting (DNA)

From: ambrosia nightwish

Date: 2 Apr, 2010 15:15:26

Message: 3 of 22

I used FCGR toolbox to calculate the frequency of appearance of all the words with K length , I obtained a final matrix of frequencies.The purpose is to find the frequency of every word reads during the process of calculation.

Subject: word occurence counting (DNA)

From: us

Date: 2 Apr, 2010 15:30:25

Message: 4 of 22

"ambrosia nightwish" <mess_imen@yahoo.fr> wrote in message <hp51me$adm$1@fred.mathworks.com>...
> I used FCGR toolbox to calculate the frequency of appearance of all the words with K length , I obtained a final matrix of frequencies.The purpose is to find the frequency of every word reads during the process of calculation.

but then - have you not already solved the problem(?)...
otherwise, show a small input/output example...

us

Subject: word occurence counting (DNA)

From: ambrosia nightwish

Date: 2 Apr, 2010 15:45:43

Message: 5 of 22

no not yet

Subject: word occurence counting (DNA)

From: us

Date: 2 Apr, 2010 15:54:20

Message: 6 of 22

"ambrosia nightwish" <mess_imen@yahoo.fr> wrote in message <hp53f7$97n$1@fred.mathworks.com>...
> no not yet

...and the example(?)...

us

Subject: word occurence counting (DNA)

From: ambrosia nightwish

Date: 2 Apr, 2010 16:00:23

Message: 7 of 22

in the FCGR toolbox of Jesús Mena-Chalco we have an example

Subject: word occurence counting (DNA)

From: ambrosia nightwish

Date: 2 Apr, 2010 16:10:31

Message: 8 of 22

if i have the sequence: AACCGTTAACGT, and i want to find the frequencies of all the dinucleotides (word of 2 letters for example), at the position 1 we read AA so the frequency of appearence is freq=1/12 , at the second position w read AC and freq=1/12, at the eighth position AA appeared fo the second time so freq=2/12 ,the calcul is stopped at the position N-len+1 (N:length of the sequence, len: length of word)

Subject: word occurence counting (DNA)

From: Roger Stafford

Date: 2 Apr, 2010 18:06:06

Message: 9 of 22

"ambrosia nightwish" <mess_imen@yahoo.fr> wrote in message <hp54tn$2ii$1@fred.mathworks.com>...
> if i have the sequence: AACCGTTAACGT, and i want to find the frequencies of all the dinucleotides (word of 2 letters for example), at the position 1 we read AA so the frequency of appearence is freq=1/12 , at the second position w read AC and freq=1/12, at the eighth position AA appeared fo the second time so freq=2/12 ,the calcul is stopped at the position N-len+1 (N:length of the sequence, len: length of word)
--------------
  Here's an outline of how you might go about it. Let v be a vector of length N with the nucleotide sequence - I am assuming they are represented by four numbers in this discussion - and k be the desired word length.

1) Create a N-k+1 by k matrix, M, containing the successive length-k words. You can use the 'hankel' function for this purpose.

2) Apply [B,m,n] = unique(M,'rows') to M. B will be a table of all the words appearing in the sequence in sorted order.

3) Apply 'histc' to the vector n to obtain the counts of B words in the sequence.

4) From the counts you can obtain the frequencies.

  Can you take it from there?

Roger Stafford

Subject: word occurence counting (DNA)

From: us

Date: 2 Apr, 2010 18:25:24

Message: 10 of 22

"ambrosia nightwish" <mess_imen@yahoo.fr> wrote in message <hp54tn$2ii$1@fred.mathworks.com>...
> if i have the sequence: AACCGTTAACGT, and i want to find the frequencies of all the dinucleotides (word of 2 letters for example), at the position 1 we read AA so the frequency of appearence is freq=1/12 , at the second position w read AC and freq=1/12, at the eighth position AA appeared fo the second time so freq=2/12 ,the calcul is stopped at the position N-len+1 (N:length of the sequence, len: length of word)

one of the many solutions

% the data
     s='AACCGTTAACGT';
     wl=2;
% the engine
     rpat=sprintf('\\S{%d,%d}',wl,wl);
     t=cell(wl,1);
for i=1:wl
     t{i,1}=regexp(s(i:end),rpat,'match').';
end
     t=cat(1,t{:});
     [tu,ix,ix]=unique(t);
     n=histc(ix,1:max(ix));
     r=[tu,num2cell(n)];
% the result
     disp(s);
     disp(r);
%{
% S =
     AACCGTTAACGT
% R =
     'AA' [2]
     'AC' [2]
     'CC' [1]
     'CG' [2]
     'GT' [2]
     'TA' [1]
     'TT' [1]
%}

us

Subject: word occurence counting (DNA)

From: Ashish Uthama

Date: 2 Apr, 2010 18:50:05

Message: 11 of 22

On Fri, 02 Apr 2010 13:10:31 -0300, ambrosia nightwish
<mess_imen@yahoo.fr> wrote:

> if i have the sequence: AACCGTTAACGT, and i want to find the frequencies
> of all the dinucleotides (word of 2 letters for example), at the
> position 1 we read AA so the frequency of appearence is freq=1/12 , at
> the second position w read AC and freq=1/12, at the eighth position AA
> appeared fo the second time so freq=2/12 ,the calcul is stopped at the
> position N-len+1 (N:length of the sequence, len: length of word)


   s='AACCGTTAACGT';

   wLen=2;

   %associative array, hash, lookup table ...(please see help)
   countMap = containers.Map();

   for indx=1: length(s)-wLen+1

       curWord = s(indx:indx+wLen-1);

       if(isKey(countMap,curWord))
           %we have seen this, increment count
           countMap(curWord)=countMap(curWord)+1;
       else
           countMap(curWord)=1;
       end

   end

   words = countMap.keys;


   frequency = countMap.values;
   %Convert to an array
   frequency = [frequency{:}];

   prob = frequency./sum(frequency)
 

Subject: word occurence counting (DNA)

From: ambrosia nightwish

Date: 3 Apr, 2010 21:36:21

Message: 12 of 22

THe problem still exists:The first solution shows the number of the counted words and gives a final result what I want to do is to find the number of appearance of words in every step i walk (increment by 1and word reading by wl), Let us take the same example s='AACCGTTAACGT'
for the words:
AAC: n=1
ACC : n=1
CCG: n=1
CGT: n=1
TTA: n=1
TAA: n=1
AAC: n=2
ACG: n=1
CGT: n=2
AS for the second solution, the containers.Map function dont exist in the matlab version that i have.

Subject: word occurence counting (DNA)

From: Bruno Luong

Date: 4 Apr, 2010 10:33:05

Message: 13 of 22

Something like this?

s = 'AACCGTTAACGT';
k = 3;

d = double(s);
A = hankel(d(1:end-k+1),d(end-k+1:end));
[u i j] = unique(A,'rows');
b = zeros(length(i),1);
c = zeros(size(j));
for n=1:length(j)
    jn = j(n);
    b(jn) = b(jn)+1;
    c(n) = b(jn);
end

S = char(A)
c

% Bruno

Subject: word occurence counting (DNA)

From: ambrosia nightwish

Date: 4 Apr, 2010 11:11:05

Message: 14 of 22

That's working Bruno, thank you all

Subject: word occurence counting (DNA)

From: Bruno Luong

Date: 4 Apr, 2010 13:38:05

Message: 15 of 22

% Here is an vectorized code (not necessary meant faster)
% http://www.mathworks.com/matlabcentral/fileexchange/24255

s = 'AACCGTTAACGT';
k = 3;

d = double(s);
A = hankel(d(1:end-k+1),d(end-k+1:end));
[u i j] = unique(A,'rows');
[js is]=sort(j);
clear c
c(is) = cell2mat(SplitVec(js,[],@(x) (1:length(x))')) % SplitVec on FEX

% Bruno

Subject: word occurence counting (DNA)

From: ambrosia nightwish

Date: 13 Apr, 2010 22:09:04

Message: 16 of 22

"Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <hp9pt1$go1$1@fred.mathworks.com>...
> Something like this?
>
> s = 'AACCGTTAACGT';
> k = 3;
>
> d = double(s);
> A = hankel(d(1:end-k+1),d(end-k+1:end));
> [u i j] = unique(A,'rows');
> b = zeros(length(i),1);
> c = zeros(size(j));
> for n=1:length(j)
> jn = j(n);
> b(jn) = b(jn)+1;
> c(n) = b(jn);
> end
>
> S = char(A)
> c
>
> % Bruno

How to convert S into a vector??

Subject: word occurence counting (DNA)

From: us

Date: 13 Apr, 2010 22:45:07

Message: 17 of 22

"ambrosia nightwish" <mess_imen@yahoo.fr> wrote in message <hq2q20$aqn$1@fred.mathworks.com>...
> "Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <hp9pt1$go1$1@fred.mathworks.com>...
> > Something like this?
> >
> > s = 'AACCGTTAACGT';
> > k = 3;
> >
> > d = double(s);
> > A = hankel(d(1:end-k+1),d(end-k+1:end));
> > [u i j] = unique(A,'rows');
> > b = zeros(length(i),1);
> > c = zeros(size(j));
> > for n=1:length(j)
> > jn = j(n);
> > b(jn) = b(jn)+1;
> > c(n) = b(jn);
> > end
> >
> > S = char(A)
> > c
> >
> > % Bruno
>
> How to convert S into a vector??

what kind of ...vector... do you mean(?)...

us

Subject: word occurence counting (DNA)

From: dpb

Date: 13 Apr, 2010 22:48:47

Message: 18 of 22

ambrosia nightwish wrote:
...

> How to convert S into a vector??

You mean like

S(:)' % ?

--

Subject: word occurence counting (DNA)

From: ambrosia nightwish

Date: 14 Apr, 2010 00:25:08

Message: 19 of 22

"Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <hp9pt1$go1$1@fred.mathworks.com>...
> Something like this?
>
> s = 'AACCGTTAACGT';
> k = 3;
>
> d = double(s);
> A = hankel(d(1:end-k+1),d(end-k+1:end));
> [u i j] = unique(A,'rows');
> b = zeros(length(i),1);
> c = zeros(size(j));
> for n=1:length(j)
> jn = j(n);
> b(jn) = b(jn)+1;
> c(n) = b(jn);
> end
>
> S = char(A)
> c
>
> % Bruno

How to convert S into a vector??

Subject: word occurence counting (DNA)

From: us

Date: 14 Apr, 2010 07:59:06

Message: 20 of 22

"ambrosia nightwish" <mess_imen@yahoo.fr> wrote in message <hq3213$eua$1@fred.mathworks.com>...
> "Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <hp9pt1$go1$1@fred.mathworks.com>...
> > Something like this?
> >
> > s = 'AACCGTTAACGT';
> > k = 3;
> >
> > d = double(s);
> > A = hankel(d(1:end-k+1),d(end-k+1:end));
> > [u i j] = unique(A,'rows');
> > b = zeros(length(i),1);
> > c = zeros(size(j));
> > for n=1:length(j)
> > jn = j(n);
> > b(jn) = b(jn)+1;
> > c(n) = b(jn);
> > end
> >
> > S = char(A)
> > c
> >
> > % Bruno
>
> How to convert S into a vector??

instead of just dumbly repeating your post, you should answer the questions people have asked you...
your performance here in CSSM is not boding well for your future...

us

Subject: word occurence counting (DNA)

From: Bruno Luong

Date: 14 Apr, 2010 08:15:22

Message: 21 of 22

"ambrosia nightwish" <mess_imen@yahoo.fr> wrote in message <hq3213$eua$1@fred.mathworks.com>...
>
> How to convert S into a vector??

Why you want convert it to"vector" (whatever "vector" is; which nobody seems to know beside you)?

To return the substring at 6th position, just call
substr = S(6,:)

The storage should be appropriate for any further manipulation. No need to convert to vector or vecteur or cell or arrow or flèche or pointeur etc

Bruno

Subject: word occurence counting (DNA)

From: ambrosia nightwish

Date: 15 Apr, 2010 15:20:25

Message: 22 of 22

>The solution which I looked for was given by Bruno. Thank you all.
best regards

Tags for this Thread

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us