Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
multiple substring vectorized

Subject: multiple substring vectorized

From: Fogato Abbestia

Date: 5 Jul, 2013 13:46:09

Message: 1 of 18

Hi !

I'm trying to vectorize this for loop

function[ out ] = GenKWord(seq , k)

p = (length(seq)-k);
out='null';

for i = 0 : p
    
    word = substring(seq,i,i+k-1);
    
    if(strcmp(out,'null'))
        out = {word};
    else
        out=cat(2,out,word);
    end
  
end


end


i wanted to use

i=[0:p]


and preallocate the size of word with

word=cell(1,p)

so that i can do something like this

word{:}=substring(seq, i , i+k-1);

to get all the values in one go without using cat.

I tryed to look for function handling but i couldn't figure out if i need it for what i'm doing.

To put it simply i know how many substring i will have from the sequence seq but the length of the seq is different every time i call the function, so it will work even doing something like

word{:}={substring(seq,1,1+k-1) substring(seq,2,2+k-1) ..... substring(seq,n,n+k-1)}

but the number of substring operation may vary and is equal to p.

how can i do this?

thx :)

Subject: multiple substring vectorized

From: dpb

Date: 5 Jul, 2013 14:32:12

Message: 2 of 18

On 7/5/2013 8:46 AM, Fogato Abbestia wrote:
> Hi !
> I'm trying to vectorize this for loop
>
> function[ out ] = GenKWord(seq , k)
> p = (length(seq)-k);
> out='null';
> for i = 0 : p
> word = substring(seq,i,i+k-1);
> if(strcmp(out,'null'))
> out = {word};
> else
> out=cat(2,out,word);
> end
> end
> end
>
>
> i wanted to use
> i=[0:p]
>
> and preallocate the size of word with
> word=cell(1,p)
> so that i can do something like this
> word{:}=substring(seq, i , i+k-1);
> to get all the values in one go without using cat.
...

Let's back up a little and describe what the input/output expected are
and a small sample...I'm not clear w/o spending more time than want
trying to read the code to decipher just what it is you want or the form
of the input to clearly see how it might be vectorized.

--

Subject: multiple substring vectorized

From: Fogato Abbestia

Date: 5 Jul, 2013 15:20:10

Message: 3 of 18

the input is a DNA sequence like

seq = 'atcggttaat'

k is an int and it's the size of the window on the string, i'm using it for the substring

What this function do is

- take the sequence
- generate the subsequence of length k starting from the first character and shifting one by one every iteration.

what i want to do is using a vectorization of the for loop insted the for loop.

I read on the forum you use to make a vector of values like this

i=[1:length(seq)]

than when i use the varible is quickier than the for loop

with this i wanted tu

- first preallocate the size of word
- second save the results, so the subsequences in the cell array word in one go with vector i

now i'm using a for loop and a concatenation of cells but cat is quite time consuming, because my DNA seqence range from 5000 character to 16450

also i would delete the if check if the value i found is the first or not
If there is another way to speed it up is also welcomed :)

thx for answering

Subject: multiple substring vectorized

From: dpb

Date: 5 Jul, 2013 15:34:00

Message: 4 of 18

On 7/5/2013 10:20 AM, Fogato Abbestia wrote:
> the input is a DNA sequence like
>
> seq = 'atcggttaat'
> k is an int and it's the size of the window on the string, i'm using it
> for the substring
>
> What this function do is
> - take the sequence
> - generate the subsequence of length k starting from the first character
> and shifting one by one every iteration.
...

OK, that's a start...

Now show an example calling sequence and what you want in
output...specifically, I'm not sure how to interpret "and shifting one
by one every iteration" .... is that one character or one subsequence
length?

--

Subject: multiple substring vectorized

From: Fogato Abbestia

Date: 5 Jul, 2013 16:15:10

Message: 5 of 18

for example

seq = 'atcggttaat'
k=3;

i want as output

word={'atc' 'tcg' 'cgg' 'ggt' 'gtt' 'tta' 'taa' 'aat'}


k= 6

i will have

word={'atcggt' 'tcggtt' 'cggtta' 'ggttaa' 'gttaat'}

the length of word is computed with

p=length(seq)-k

so if k is bigger the number of subsequences is less, and seems right to me.

Subject: multiple substring vectorized

From: dpb

Date: 5 Jul, 2013 17:07:48

Message: 6 of 18

On 7/5/2013 11:15 AM, Fogato Abbestia wrote:
> for example
>
> seq = 'atcggttaat'
> k=3;
>
> i want as output
>
> word={'atc' 'tcg' 'cgg' 'ggt' 'gtt' 'tta' 'taa' 'aat'}
>
...

OK, so it is length k and does step by a single character not the
pattern length...let's see...

Brute force--

N=length(seq)-(k-1); % number of tokens possible
word=cell(1,N);
for i=1:N
   word(i)={seq(i:i+k-1)};
end

Vectorized...hmmm....oh, ok, try this--

word=seq(cumsum([[1:k];ones(length(seq)-k,k)]));

This will return a column vector rather than row vector if that's alright...

--

Subject: multiple substring vectorized

From: Fogato Abbestia

Date: 6 Jul, 2013 07:37:10

Message: 7 of 18

thx it worked ! :)

i needed a row but i can handle even a column :)

thx a lot :))))

Subject: multiple substring vectorized

From: dpb

Date: 6 Jul, 2013 09:37:33

Message: 8 of 18

On 7/6/2013 2:37 AM, Fogato Abbestia wrote:
> thx it worked ! :)
>
> i needed a row but i can handle even a column :)
>
...

Whatever form you wish...the above is a cell containing the 8 char
strings...if you prefer/need a cell array of a string/cell, then use
cellstr()...

 >> word={seq(cumsum([[1:k];ones(length(seq)-k,k)]))}
word =
     [8x3 char]
 >> word=cellstr(seq(cumsum([[1:k];ones(length(seq)-k,k)])))
word =
     'atc'
     'tcg'
     'cgg'
     'ggt'
     'gtt'
     'tta'
     'taa'
     'aat'
 >> whos word
   Name Size Bytes Class Attributes

   word 8x1 528 cell

For a row vector, just transpose...

 >> word=cellstr(seq(cumsum([[1:k];ones(length(seq)-k,k)])))'
word =
     'atc' 'tcg' 'cgg' 'ggt' 'gtt' 'tta' 'taa' 'aat'
 >>

Salt to suit...

--

Subject: multiple substring vectorized

From: Fogato Abbestia

Date: 6 Jul, 2013 11:08:07

Message: 9 of 18

dpb <none@non.net> wrote in message <kr8ogr$jc1$1@speranza.aioe.org>...
> On 7/6/2013 2:37 AM, Fogato Abbestia wrote:
> > thx it worked ! :)
> >
> > i needed a row but i can handle even a column :)
> >
> ...
>
> Whatever form you wish...the above is a cell containing the 8 char
> strings...if you prefer/need a cell array of a string/cell, then use
> cellstr()...
>
> >> word={seq(cumsum([[1:k];ones(length(seq)-k,k)]))}
> word =
> [8x3 char]
> >> word=cellstr(seq(cumsum([[1:k];ones(length(seq)-k,k)])))
> word =
> 'atc'
> 'tcg'
> 'cgg'
> 'ggt'
> 'gtt'
> 'tta'
> 'taa'
> 'aat'
> >> whos word
> Name Size Bytes Class Attributes
>
> word 8x1 528 cell
>
> For a row vector, just transpose...
>
> >> word=cellstr(seq(cumsum([[1:k];ones(length(seq)-k,k)])))'
> word =
> 'atc' 'tcg' 'cgg' 'ggt' 'gtt' 'tta' 'taa' 'aat'
> >>
>
> Salt to suit...
>
> --

i tryed to transpose but it doesn't work and return me the column vector...

Subject: multiple substring vectorized

From: Fogato Abbestia

Date: 6 Jul, 2013 11:21:10

Message: 10 of 18

dpb <none@non.net> wrote in message <kr8ogr$jc1$1@speranza.aioe.org>...
> On 7/6/2013 2:37 AM, Fogato Abbestia wrote:
> > thx it worked ! :)
> >
> > i needed a row but i can handle even a column :)
> >
> ...
>
> Whatever form you wish...the above is a cell containing the 8 char
> strings...if you prefer/need a cell array of a string/cell, then use
> cellstr()...
>
> >> word={seq(cumsum([[1:k];ones(length(seq)-k,k)]))}
> word =
> [8x3 char]
> >> word=cellstr(seq(cumsum([[1:k];ones(length(seq)-k,k)])))
> word =
> 'atc'
> 'tcg'
> 'cgg'
> 'ggt'
> 'gtt'
> 'tta'
> 'taa'
> 'aat'
> >> whos word
> Name Size Bytes Class Attributes
>
> word 8x1 528 cell
>
> For a row vector, just transpose...
>
> >> word=cellstr(seq(cumsum([[1:k];ones(length(seq)-k,k)])))'
> word =
> 'atc' 'tcg' 'cgg' 'ggt' 'gtt' 'tta' 'taa' 'aat'
> >>
>
> Salt to suit...
>
> --

No wait now it work ! man i love you xD the time for generating the matrix was 75 seconds and now is 25 sec !!! xD

Subject: multiple substring vectorized

From: Fogato Abbestia

Date: 6 Jul, 2013 11:30:13

Message: 11 of 18


> i tryed to transpose but it doesn't work and return me the column vector...

Wait now it works ! thx a lot for the help, the generation of the matrix was 74-76 sec and now is 25 sec.

can i ask one last thig?

i'm trying to search for common subsequence between 2 cell arrays (the one generated with the code you made).

what i do is using intersect but it takes a lot of time because it use sort() that i don't need.

i tryed ismember() and then cellarr(ismember()) to get the elements faster, and i think it is but i need to remove duplicates and for this i used unique(cellarr) but it also use that damn sort().

How can i do that?

thx

Subject: multiple substring vectorized

From: Marc

Date: 6 Jul, 2013 12:52:22

Message: 12 of 18

"Fogato Abbestia" wrote in message <kr8v45$ni4$1@newscl01ah.mathworks.com>...
>
> > i tryed to transpose but it doesn't work and return me the column vector...
>
> Wait now it works ! thx a lot for the help, the generation of the matrix was 74-76 sec and now is 25 sec.
>
> can i ask one last thig?
>
> i'm trying to search for common subsequence between 2 cell arrays (the one generated with the code you made).
>
> what i do is using intersect but it takes a lot of time because it use sort() that i don't need.
>
> i tryed ismember() and then cellarr(ismember()) to get the elements faster, and i think it is but i need to remove duplicates and for this i used unique(cellarr) but it also use that damn sort().
>
> How can i do that?
>
> thx

Have you looked at regexp() and cellfun()?

These may help you. I don't have time today to play but take a look at these to see if they help.

Subject: multiple substring vectorized

From: dpb

Date: 6 Jul, 2013 13:18:57

Message: 13 of 18

On 7/6/2013 6:08 AM, Fogato Abbestia wrote:
...

> i tryed to transpose but it doesn't work and return me the column vector...


Show your work...can't tell why/what w/o knowing what you actually did.

As shown, in earlier post,works either way given the input, that was
pasted from command window.

 >> word=cellstr(seq(cumsum([[1:k];ones(length(seq)-k,k)])))'
word =
     'atc' 'tcg' 'cgg' 'ggt' 'gtt' 'tta' 'taa' 'aat'
 >>

--

Subject: multiple substring vectorized

From: james bejon

Date: 6 Jul, 2013 16:56:09

Message: 14 of 18

"Fogato Abbestia" wrote in message <kr8v45$ni4$1@newscl01ah.mathworks.com>...
>
> > i tryed to transpose but it doesn't work and return me the column vector...
>
> Wait now it works ! thx a lot for the help, the generation of the matrix was 74-76 sec and now is 25 sec.
>
> can i ask one last thig?
>
> i'm trying to search for common subsequence between 2 cell arrays (the one generated with the code you made).
>
> what i do is using intersect but it takes a lot of time because it use sort() that i don't need.
>
> i tryed ismember() and then cellarr(ismember()) to get the elements faster, and i think it is but i need to remove duplicates and for this i used unique(cellarr) but it also use that damn sort().
>
> How can i do that?
>
> thx

Do you have a maximum value of k in mind?

Also, what's the nature of your entire task, from start to finish?

Subject: multiple substring vectorized

From: Fogato Abbestia

Date: 8 Jul, 2013 10:37:10

Message: 15 of 18

i have this cell array named set, each elementi of set is a cell array. This last cell array contains the name of the gene and all his subsequence like this

set{1}

ans =

  Columns 1 through 3

    'aadK' 'atgcgaagtgagcagg' 'tgcgaagtgagcagga'

  Columns 4 through 6

    'gcgaagtgagcaggaa' 'cgaagtgagcaggaaa' 'gaagtgagcaggaaat'

  Columns 7 through 9

    'aagtgagcaggaaatg' 'agtgagcaggaaatga' 'gtgagcaggaaatgat'

the elementi set(1) has dimensions [1x1675], so i dont post it all. this example is for k=16 so each subsequence has 16 char length.

the second element is the same and this goes on to the last 4106 array/gene.

now i need to find for each gene/array if it has common subsequence with all the other genes, so i must match 2 arrays each time keeping the first and changing the second and i do this for all the n-1 other genes, so that the next iteration i'm on the second gene and i need to match it with the other n-2, because i already did it with the first.


I match 2 arrays (without the names of the genes) with intersect but this method has unique inside wich has sort().
Actually i don't need to sort elements and i need to have a quickier program because i have 4106 genes and each one take from 25 to 68 sec.

Of course the time is less and less each time i go to the next gene because i have less genes to match but it still a long time because for k<16 it will take more time because i will have more subsequences. I need to perform the program for k=6:24.

Subject: multiple substring vectorized

From: james bejon

Date: 8 Jul, 2013 12:27:11

Message: 16 of 18

I'm not 100% sure I'm with you. You've basically got a cell array of cell arrays. And you need to work out if the strings in the first element can be matched with a subsequence where exactly?

Suppose, for instance, your cell array is as follows:

{{'ac', 'tg'}, {'actgac', 'accaac'}, {'actgac', 'cacaca'}, {'ggggtg', 'tgtgtg'}}

What should the answer be? Or what should the cell array look like if this isn't a good example of a starting-point?

Subject: multiple substring vectorized

From: Fogato Abbestia

Date: 8 Jul, 2013 13:17:29

Message: 17 of 18

"james bejon" wrote in message <kreb6u$ne6$1@newscl01ah.mathworks.com>...
> I'm not 100% sure I'm with you. You've basically got a cell array of cell arrays. And you need to work out if the strings in the first element can be matched with a subsequence where exactly?
>
> Suppose, for instance, your cell array is as follows:
>
> {{'ac', 'tg'}, {'actgac', 'accaac'}, {'actgac', 'cacaca'}, {'ggggtg', 'tgtgtg'}}
>
> What should the answer be? Or what should the cell array look like if this isn't a good example of a starting-point?

you lost the point that all the sequence in all the arrays have the same length that is k.

So i will have a cell array of cell arrays as you understood,

with k=5 i.e.

{{'acaag' 'gttca'} {'aaaaa' 'tttgg' 'acact' 'acaag'} {'gttca' 'aaaaa' 'tttgg' 'tatat'}}

what i want are the common elements between the 1st and the 2nd, 1st and 3rd.
after that i shift the beginning position to the 2nd and i do 2nd with 3rd

the result will be saved as a cell array

{{namegene1 namegene2 common_elements} {namegene1 namegene2 common_elements} {name2 name3 common_elements}}

because after that i use cat(1,cellarray of cell arrays{:})

for the names i have no problem , the match will find common_elements


the result i want for common_elementi are

{'acaag'} 1st and 2nd

so {name1 name2 'acaag'}

1st and 3rd {name1 name2 'gttca'}

2nd and 3rd {name2 name 3 'aaaaa' 'tttgg'}

that are saved like this
cellarray={{name1 name2 'acaag'} {name1 name2 'gttca'}{name2 name 3 'aaaaa' 'tttgg'}}

i actually get this right but intersect is really time consuming so i need a quickier way to do it.

Subject: multiple substring vectorized

From: james bejon

Date: 8 Jul, 2013 14:47:10

Message: 18 of 18


% OK. For a start, I'd recommend storing your strings as 2-d character arrays rather than cells. (Processing them will be much quicker.)

% So if you start with:
c = {{'acaag' 'gttca'} {'aaaaa' 'tttgg' 'acact' 'acaag'}};

% Then you can convert them as follows
c = cellfun(@char, c, 'UniformOutput', 0);

% (Or, better still, read them in as character arrays in the first place)

% You can then match them as follows

[ynflag, subs] = ismember(c{1}, c{2}, 'rows');

% And recover like this:

c{2}(subs(ynflag), :)




% A more radical move would be to convert your data into a more compact form. There are only four bases, right? So…


% Starting data
s = {{'acaag' 'gttca'} {'aaaaa' 'tttgg' 'acact' 'acaag'}};


% Copy it
c = {{'acaag' 'gttca'} {'aaaaa' 'tttgg' 'acact' 'acaag'}};


% Convert "c" to integers
for i = 1:2

  c{i} = double(char(c{i}));
  c{i}(c{i} == 'a') = 0;
  c{i}(c{i} == 'c') = 1;
  c{i}(c{i} == 't') = 2;
  c{i}(c{i} == 'g') = 3;
  v = 4.^(0:size(c{i}, 2)-1).';
  c{i} = c{i} * v;

end


% Look up as before (but with no need for "rows" flag)
[ynflag, subs] = ismember(c{1}, c{2});


% Recover answers
s{2}(subs(ynflag))



% Though you might have to do a bit of extra work to handle long sequences

Tags for this Thread

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us