How can I save the beginning and end positions of each sequence in a cell array?

Question

0 votes

So I am looping through codons and recording them on a .txt file. The script works, but I need the sequence to begin at the starting codon position, stop at the end codon then continue through the cell array while recording all of the following start and end codon sequences. I would just like to know the best option I can use to tweak my code here. Thanks in advance!

fid = fopen("sequence_long2.txt",'r');
C = textscan(fid,'%3s');
x = C{1}
fclose(fid);
%Start sequence
ss = 1;
% end sequence
es = 183479;
seq_id = long_codon(x(ss:es));

function seq = long_codon(v)
  seq = (v);
  for pos = 1:length(seq)
    if strcmp(seq{pos},'TAC')
        index = find(strcmp(v,seq{pos}));
        StartPos = index;
    elseif (strcmp(seq{pos},'ACT') || strcmp(seq{pos},'ATT') || strcmp(seq{pos},'ATC'))
        index = find(strcmp(v,seq{pos}));
        EndPos = index;   
  end
  end
  fid2 = fopen('report_long.txt','w+');
  fprintf(fid2,'Name: OP \n'); 
  fprintf(fid2,'Lab 13: DNA Pattern Matching\n \n');
  fprintf(fid2,'Start Position of Gene is: %d \n',StartPos); 
  fprintf(fid2, 'End Position of Gene is: %d \n',EndPos);
  fclose(fid2);  
end

14 Comments
Show 12 older comments Hide 12 older comments

dpb on 27 Nov 2020

Open in MATLAB Online

The above code can't work -- neither if condition can ever be satisfied because you're trying to compare a string of length 1 (seq{pos}) to one of length 3 ('TAC') for the start condtion. Similarly for the subsequent ending conditions.

"strcmp(s1,s2) compares s1 and s2 and returns 1 (true) if the two are identical and 0 (false) otherwise. Text is considered identical if the size and content of each are the same. "

Also, it will fail because it doesn't require a START token be found first; it just goes on to the next character whether has found a beginning or not.

You need to stay in the first conditional loop until find a start; then start looking for an end.

It seems like strfind would be a lot more useful here.

There's a couple of interesting "features" in the string -- I dunno enuf about the subject to know if this is normal or not but--

seq=importdata('sequence_long2.txt');
ss = 1;
es = 183479;
seq=seq{:}(ss:es);
iStart=strfind(seq,'TAC');
>> whos iStart
  Name        Size              Bytes  Class     Attributes
  iStart      1x2405            19240  double              
>> iEnd=cell2mat(regexp(seq,{'ACT|ATT|ATC'}));
>> whos iStart, iEnd
  Name      Size              Bytes  Class     Attributes
  iEnd      1x7493            59944  double              
  iStart    1x2405            19240  double
>>

Shows there are some 3X the number of End than Start. This expected???

>> iStart(1:20)
ans =
  Columns 1 through 17
          74          89         101         279         399         461         711         797         846         864         890         965        1013        1044        1289        1382        1404
  Columns 18 through 20
        1496        1614        1622
>> iEnd(1:20)
ans =
    72    75   131   154   213   242   267   270   282   294   300   317   341   391   451   462   491   549   575   580
>> seq(iStart(1):iEnd(3))
ans =
    'TACTCTGTTGCAACCTACGATGTCCCCTACATGCCAGAAAGTGGTGTAATAGTCGACA'
>> 

Shows why the one-at-a-time stepping through the sequence won't work -- there's what looks like an 'ACT' stop code just one character after the first character of the first start code.

"TACT" contains "TAC" as 1:3 but "ACT" as 2:4.

If 'TAC' universally means a start codon (there can't be something embedded like the above that is the right sequence but not a possible valid start position), then I'd use the iStart array above and step through it looking for the end sequence.

It may be owing to the above kind of thing can't just compare that the next iEnd index is >3 from the iStart; I don't know if other permutations are allowable or not.

Austin Shipley on 27 Nov 2020

Edited: Austin Shipley on 27 Nov 2020

First, I would like to thank you for your thoughtful response here.

Second, after tinkering around with this code I have come to the conclusion that you are indeed correct, and this code can never work in its present format. I'm learning that a series of while loops will likely be the best solution to this problem; where I would define 'start_codon' and 'end_codon' variable vectors, along with corresponding counters that step through the array when specific characters are detected. Additionally, I'll see if I can make use of strfind here as well.

I don't have it functional yet, but I'll post it when more of it is completed.

"Shows there are some 3X the number of End than Start. This expected???"

I would say yes, because there are three different permutations that qualify as 'end codons'. Any one of those triplets (codons) should be recorded as an end codon if preceeded by 'TAC'. There just happens to be more end codons than start.

"TACT" contains "TAC" as 1:3 but "ACT" as 2:4.

This is a valid point, I certainly can't have my end condons being counted if they are merged with the start codon. Thank you for pointing that out, I will keep that in mind as I am restructuring the code.

Also, it was recommended to me that I should define a 'flag' variable that would take on one value as I am in search of beginning and end codons. Not entirely sure just yet how that is supposed to help me, but I'll keep messing around with it and hopefully see what works and what doesn't.

Austin Shipley on 27 Nov 2020

Edited: Austin Shipley on 27 Nov 2020

Open in MATLAB Online

So here is what I have so far:

fid = fopen("sequence_long2.txt",'r');
C = textscan(fid,'%s');
x = C{1};
fclose(fid);
x_conv = char(x);
Start_loc = [];
End_loc = [];
i = 1;
while i < length(x_conv)
    
    if x_conv(i) == 'T' && x_conv(i+1) == 'A' && x_conv(i+2) == 'C'
        Start_loc = [Start_loc i];
        i = i + 3;
    elseif (x_conv(i) == 'A' && x_conv(i+1) == 'T' && x_conv(i+2) == 'C') || (x_conv(i) == 'A' && x_conv(i+1) == 'T' && x_conv(i+2) == 'T') || (x_conv(i) == 'A' && x_conv(i+1) == 'C' && x_conv(i+2) == 'T')
        End_loc = [End_loc i];
        i = i + 3;
        
    else
        i = i + 1;
    end
end
fid2 = fopen('report_long.txt','w+');
fprintf(fid2,'Name: OP \n'); 
fprintf(fid2,'Lab 13: DNA Pattern Matching\n \n');
fprintf(fid2,'Start Position of Gene is: %d           End Position of Gene is: %d\n ',Start_loc,End_loc); 
fclose(fid2);  

I converted the cell array to a character array so I could index through individual characters with logical operators. I'm sure there is a more concise route I could have taken, but this started actually giving me some results, so I stuck with it.

So, I am getting a . txt file recording my start and end codons, but there is an issue I still need to resolve. Here I am indexing between the start and stop positions that appear on my .txt file:

x_conv(74:131)
ans = 'TACTCTGTTGCAACCTACGATGTCCCCTACATGCCAGAAAGTGGTGTAATAGTCGACA'

As you can see here, there is a start codon beginning at 74 through 76, 'TAC', but the .txt file records it at position 75, which would be 'ACT', and that should not be a start codon.

Any ideas how to prevent that from happening?

EDIT

Just realized that I had the wrong call variable in my fprintf statement for the 'Start' position. The same issue seems to arise still.

Austin Shipley on 27 Nov 2020

Open in MATLAB Online

UPDATE:

I added a 'flag' variable that would track the emergence of a start or stop codon. This seems to give me a better result, with the start codon being indexed correctly. However, the end positions do not appear to be any of the desired permutations.

fid = fopen("sequence_long2.txt",'r');
C = textscan(fid,'%s');
x = C{1};
fclose(fid);
x_conv = char(x);
Start_loc = [];
End_loc = [];
flag = 0;
i = 1;
while i < length(x_conv)
    
    
    if (x_conv(i) == 'T' && x_conv(i+1) == 'A' && x_conv(i+2) == 'C') && flag == 0
        
        Start_loc = [Start_loc i];
        
        i = i + 3;
        
        flag = flag + 1;
        
    elseif ((x_conv(i) == 'A' && x_conv(i+1) == 'T' && x_conv(i+2) == 'C') || (x_conv(i) == 'A' && x_conv(i+1) == 'C' && x_conv(i+2) == 'T') || (x_conv(i) == 'A' && x_conv(i+1) == 'T' && x_conv(i+2) == 'T')) && flag == 1  
        End_loc = [End_loc i];
        i = i + 3;
        flag = flag - 1;
      
        
        
        
        
    else
        i = i + 1;
      
        
    end
    
end
fid2 = fopen('report_long.txt','w+');
fprintf(fid2,'Name: OP \n'); 
fprintf(fid2,'Lab 13: DNA Pattern Matching\n \n');
fprintf(fid2,'Start Position of Gene is: %d           End Position of Gene is: %d\n ',Start_loc,End_loc); 
fclose(fid2);  

Rik on 28 Nov 2020

I would urge you to change to strfind first. Then you can loop through all start codons, removing later start codons if they are inside the gene being read.

Austin Shipley on 28 Nov 2020

Edited: Austin Shipley on 28 Nov 2020

Open in MATLAB Online

So I have been trying to use strfind, but I am still having this issue where my end codon positions are not being recorded correctly. Do I need to nest another while loop or am I just not using strfind properly?

fid = fopen("sequence_long2.txt",'r');
C = textscan(fid,'%s');
x = C{1};
fclose(fid);
x_conv = char(x);
Start_loc = [];
End_loc = [];
flag = 0;
i = 1;
while i<(numel(x_conv)-2)
        
    
    if (strcmp(x_conv(i+[0 1 2]),'TAC')) && flag == 0 
        
        Start_loc = strfind(x_conv,'TAC');
        i = i + 3;
        flag = flag + 1;
       
        
    elseif ismember(x_conv(i+[0 1 2]),{'ATC','ACT','ATT'}) && flag == 1
        End_loc = [End_loc i];
        i = i + 3;
        flag = flag - 1;
    else
        i = i+1;     
        
    end
    
end
fid2 = fopen('report_long.txt','w+');
fprintf(fid2,'Name: Austin  \n'); 
fprintf(fid2,'Lab 13: DNA Pattern Matching\n \n');
fprintf(fid2,'Start Position of Gene is: %d           End Position of Gene is: %d\n ',Start_loc,End_loc); 
fclose(fid2);  

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Rik on 29 Nov 2020

Open in MATLAB Online

0 votes

%Since your code is working fine you can keep it as is.
%I just used my own function to use your data.
x_conv=readfile('https://www.mathworks.com/matlabcentral/answers/uploaded_files/430218/sequence_long2.txt');
x_conv=x_conv{1};
%find all possible start codons and stop codons
Start_loc = strfind(x_conv,'TAC');
End_loc = cellfun(@(stopcodon)strfind(x_conv,stopcodon),{'ATC','ACT','ATT'},'UniformOutput',false);
End_loc = horzcat(End_loc{:});
n=0;
while n<numel(Start_loc)
    n=n+1;
    this_start=Start_loc(n);
    
    %select all possible end codons
    this_end=End_loc(End_loc>this_start);
    
    %figure out which is the first end codon with an offset of 3
    this_end=this_end(mod(this_end-this_start,3)==0);
    this_end=this_end(1);
    
    %now we need to remove elements in Start_loc that in the current gene
    Start_loc(Start_loc>this_start & Start_loc<this_end)=[];
    
    %store the end as well
    End_loc(n)=this_end;
end
%remove extra values in End_loc
End_loc((n+1):end)=[];
genes=cell(size(End_loc));
for n=1:numel(End_loc)
    genes{n}=x_conv(Start_loc(n):End_loc(n));
end

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

How can I save the beginning and end positions of each sequence in a cell array?

14 Comments
Show 12 older comments Hide 12 older comments

Accepted Answer

0 Comments
Show -2 older comments Hide -2 older comments

More Answers (0)

Categories

Tags

Community Treasure Hunt

How can I save the beginning and end positions of each sequence in a cell array?

14 Comments Show 12 older comments Hide 12 older comments

Accepted Answer

0 Comments Show -2 older comments Hide -2 older comments

More Answers (0)

Categories

Tags

See Also

Community Treasure Hunt

14 Comments
Show 12 older comments Hide 12 older comments

0 Comments
Show -2 older comments Hide -2 older comments