How can I save the beginning and end positions of each sequence in a cell array?

So I am looping through codons and recording them on a .txt file. The script works, but I need the sequence to begin at the starting codon position, stop at the end codon then continue through the cell array while recording all of the following start and end codon sequences. I would just like to know the best option I can use to tweak my code here. Thanks in advance!
fid = fopen("sequence_long2.txt",'r');
C = textscan(fid,'%3s');
x = C{1}
fclose(fid);
%Start sequence
ss = 1;
% end sequence
es = 183479;
seq_id = long_codon(x(ss:es));
function seq = long_codon(v)
seq = (v);
for pos = 1:length(seq)
if strcmp(seq{pos},'TAC')
index = find(strcmp(v,seq{pos}));
StartPos = index;
elseif (strcmp(seq{pos},'ACT') || strcmp(seq{pos},'ATT') || strcmp(seq{pos},'ATC'))
index = find(strcmp(v,seq{pos}));
EndPos = index;
end
end
fid2 = fopen('report_long.txt','w+');
fprintf(fid2,'Name: OP \n');
fprintf(fid2,'Lab 13: DNA Pattern Matching\n \n');
fprintf(fid2,'Start Position of Gene is: %d \n',StartPos);
fprintf(fid2, 'End Position of Gene is: %d \n',EndPos);
fclose(fid2);
end

14 Comments

Need a more complete description of what input is and what is wanted...probably somebody knowledgeable in the field knows that, but not all of us are...
If there's a beginning sequence string like 'TAC' and ending alternate strings, why not just use strfind to locate each instead of character-by-character looping?
The idea is to loop through the genome file, 'sequence_long2.txt' and once the element 'TAC' is identified, record that index value into the text file 'report_long.txt', stop searching for that start codon, and begin searching for the end codons, 'ACT', 'ATC', or 'ATT'. Once any of the end codons are identified, record the end position index then proceed to switch tasks and search for another gene (i.e. start and end positions of the same codon elements mentioned above) until the entire file has been scanned.
The above code can't work -- neither if condition can ever be satisfied because you're trying to compare a string of length 1 (seq{pos}) to one of length 3 ('TAC') for the start condtion. Similarly for the subsequent ending conditions.
"strcmp(s1,s2) compares s1 and s2 and returns 1 (true) if the two are identical and 0 (false) otherwise. Text is considered identical if the size and content of each are the same. "
Also, it will fail because it doesn't require a START token be found first; it just goes on to the next character whether has found a beginning or not.
You need to stay in the first conditional loop until find a start; then start looking for an end.
It seems like strfind would be a lot more useful here.
There's a couple of interesting "features" in the string -- I dunno enuf about the subject to know if this is normal or not but--
seq=importdata('sequence_long2.txt');
ss = 1;
es = 183479;
seq=seq{:}(ss:es);
iStart=strfind(seq,'TAC');
>> whos iStart
Name Size Bytes Class Attributes
iStart 1x2405 19240 double
>> iEnd=cell2mat(regexp(seq,{'ACT|ATT|ATC'}));
>> whos iStart, iEnd
Name Size Bytes Class Attributes
iEnd 1x7493 59944 double
iStart 1x2405 19240 double
>>
Shows there are some 3X the number of End than Start. This expected???
>> iStart(1:20)
ans =
Columns 1 through 17
74 89 101 279 399 461 711 797 846 864 890 965 1013 1044 1289 1382 1404
Columns 18 through 20
1496 1614 1622
>> iEnd(1:20)
ans =
72 75 131 154 213 242 267 270 282 294 300 317 341 391 451 462 491 549 575 580
>> seq(iStart(1):iEnd(3))
ans =
'TACTCTGTTGCAACCTACGATGTCCCCTACATGCCAGAAAGTGGTGTAATAGTCGACA'
>>
Shows why the one-at-a-time stepping through the sequence won't work -- there's what looks like an 'ACT' stop code just one character after the first character of the first start code.
"TACT" contains "TAC" as 1:3 but "ACT" as 2:4.
If 'TAC' universally means a start codon (there can't be something embedded like the above that is the right sequence but not a possible valid start position), then I'd use the iStart array above and step through it looking for the end sequence.
It may be owing to the above kind of thing can't just compare that the next iEnd index is >3 from the iStart; I don't know if other permutations are allowable or not.
First, I would like to thank you for your thoughtful response here.
Second, after tinkering around with this code I have come to the conclusion that you are indeed correct, and this code can never work in its present format. I'm learning that a series of while loops will likely be the best solution to this problem; where I would define 'start_codon' and 'end_codon' variable vectors, along with corresponding counters that step through the array when specific characters are detected. Additionally, I'll see if I can make use of strfind here as well.
I don't have it functional yet, but I'll post it when more of it is completed.
"Shows there are some 3X the number of End than Start. This expected???"
  • I would say yes, because there are three different permutations that qualify as 'end codons'. Any one of those triplets (codons) should be recorded as an end codon if preceeded by 'TAC'. There just happens to be more end codons than start.
"TACT" contains "TAC" as 1:3 but "ACT" as 2:4.
  • This is a valid point, I certainly can't have my end condons being counted if they are merged with the start codon. Thank you for pointing that out, I will keep that in mind as I am restructuring the code.
Also, it was recommended to me that I should define a 'flag' variable that would take on one value as I am in search of beginning and end codons. Not entirely sure just yet how that is supposed to help me, but I'll keep messing around with it and hopefully see what works and what doesn't.
Because you need to read in triplets, can't you filter away all end codons whose location is not a multiple of 3 away from the previous start codon? That would mean you need 1 call to strfind to find start codons, and a while loop to go through them. (not a for loop, since some start codons might occur in a protein with an offset, so not every location will be a start codon)
"Because you need to read in triplets,..."
That was my first idea was to reshape() by [],3 but then I discover that there are start codons that aren't at a mutliple of three into the string...
As noted, I don't know enough about the actual gene structure sequence to know what is/isn't valid so what simplifying assumptions can be made.
What I know about genetics is that from the start codon onwards the gene is read in triplets (ignoring introns and exons for now...), but from an end codon to the next start does not have to be a triplet.
So here is what I have so far:
fid = fopen("sequence_long2.txt",'r');
C = textscan(fid,'%s');
x = C{1};
fclose(fid);
x_conv = char(x);
Start_loc = [];
End_loc = [];
i = 1;
while i < length(x_conv)
if x_conv(i) == 'T' && x_conv(i+1) == 'A' && x_conv(i+2) == 'C'
Start_loc = [Start_loc i];
i = i + 3;
elseif (x_conv(i) == 'A' && x_conv(i+1) == 'T' && x_conv(i+2) == 'C') || (x_conv(i) == 'A' && x_conv(i+1) == 'T' && x_conv(i+2) == 'T') || (x_conv(i) == 'A' && x_conv(i+1) == 'C' && x_conv(i+2) == 'T')
End_loc = [End_loc i];
i = i + 3;
else
i = i + 1;
end
end
fid2 = fopen('report_long.txt','w+');
fprintf(fid2,'Name: OP \n');
fprintf(fid2,'Lab 13: DNA Pattern Matching\n \n');
fprintf(fid2,'Start Position of Gene is: %d End Position of Gene is: %d\n ',Start_loc,End_loc);
fclose(fid2);
I converted the cell array to a character array so I could index through individual characters with logical operators. I'm sure there is a more concise route I could have taken, but this started actually giving me some results, so I stuck with it.
So, I am getting a . txt file recording my start and end codons, but there is an issue I still need to resolve. Here I am indexing between the start and stop positions that appear on my .txt file:
x_conv(74:131)
ans = 'TACTCTGTTGCAACCTACGATGTCCCCTACATGCCAGAAAGTGGTGTAATAGTCGACA'
As you can see here, there is a start codon beginning at 74 through 76, 'TAC', but the .txt file records it at position 75, which would be 'ACT', and that should not be a start codon.
Any ideas how to prevent that from happening?
EDIT
Just realized that I had the wrong call variable in my fprintf statement for the 'Start' position. The same issue seems to arise still.
UPDATE:
I added a 'flag' variable that would track the emergence of a start or stop codon. This seems to give me a better result, with the start codon being indexed correctly. However, the end positions do not appear to be any of the desired permutations.
fid = fopen("sequence_long2.txt",'r');
C = textscan(fid,'%s');
x = C{1};
fclose(fid);
x_conv = char(x);
Start_loc = [];
End_loc = [];
flag = 0;
i = 1;
while i < length(x_conv)
if (x_conv(i) == 'T' && x_conv(i+1) == 'A' && x_conv(i+2) == 'C') && flag == 0
Start_loc = [Start_loc i];
i = i + 3;
flag = flag + 1;
elseif ((x_conv(i) == 'A' && x_conv(i+1) == 'T' && x_conv(i+2) == 'C') || (x_conv(i) == 'A' && x_conv(i+1) == 'C' && x_conv(i+2) == 'T') || (x_conv(i) == 'A' && x_conv(i+1) == 'T' && x_conv(i+2) == 'T')) && flag == 1
End_loc = [End_loc i];
i = i + 3;
flag = flag - 1;
else
i = i + 1;
end
end
fid2 = fopen('report_long.txt','w+');
fprintf(fid2,'Name: OP \n');
fprintf(fid2,'Lab 13: DNA Pattern Matching\n \n');
fprintf(fid2,'Start Position of Gene is: %d End Position of Gene is: %d\n ',Start_loc,End_loc);
fclose(fid2);
Why not use strfind(x_con,'TAC')?
At least using this will clean up your syntax a tiny bit:
% Replace this
(x_conv(i) == 'T' && x_conv(i+1) == 'A' && x_conv(i+2) == 'C')
% With this
strcmp(x_conv(i+[0 1 2]),'TAC')
Also, you should probably not let your while loop run until the end of your vector:
while i<(numel(x_conv)-2)
I suppose I don't really have a good reason as to why I didn't use strfind. I was just manipulating the code with some more familiar prompts, and this just started giving me results.
All valid points. I'll clean this up a bit. Thank you.
My primary issue right now is that the end codon it records is the beginning of where a start codon should begin 'TAC'. For example:
x_conv(74:279)
ans = 'TACTCTGTTGCAAC...............AAAGACTATTCTCTGCT' %the end element here is 'T'
If I index up to 281, you would see that the codon it is indexing the position of is 'TAC', which is obviously wrong.
It doesn't seem to be notice the elseif statement at all. I would need a way of stopping the loop after it locates 'TAC' then pass control to the second condition. I thought that maybe a break or continue in the loop would give me what I need, but that doesn't seem to be working. Presumably, I might need to nest another while loop with a different condition to be fulfilled, but I'm not completely sure how to implement that just yet.
Advice is greatly appreciated.
I would urge you to change to strfind first. Then you can loop through all start codons, removing later start codons if they are inside the gene being read.
So I have been trying to use strfind, but I am still having this issue where my end codon positions are not being recorded correctly. Do I need to nest another while loop or am I just not using strfind properly?
fid = fopen("sequence_long2.txt",'r');
C = textscan(fid,'%s');
x = C{1};
fclose(fid);
x_conv = char(x);
Start_loc = [];
End_loc = [];
flag = 0;
i = 1;
while i<(numel(x_conv)-2)
if (strcmp(x_conv(i+[0 1 2]),'TAC')) && flag == 0
Start_loc = strfind(x_conv,'TAC');
i = i + 3;
flag = flag + 1;
elseif ismember(x_conv(i+[0 1 2]),{'ATC','ACT','ATT'}) && flag == 1
End_loc = [End_loc i];
i = i + 3;
flag = flag - 1;
else
i = i+1;
end
end
fid2 = fopen('report_long.txt','w+');
fprintf(fid2,'Name: Austin \n');
fprintf(fid2,'Lab 13: DNA Pattern Matching\n \n');
fprintf(fid2,'Start Position of Gene is: %d End Position of Gene is: %d\n ',Start_loc,End_loc);
fclose(fid2);

Sign in to comment.

 Accepted Answer

%Since your code is working fine you can keep it as is.
%I just used my own function to use your data.
x_conv=readfile('https://www.mathworks.com/matlabcentral/answers/uploaded_files/430218/sequence_long2.txt');
x_conv=x_conv{1};
%find all possible start codons and stop codons
Start_loc = strfind(x_conv,'TAC');
End_loc = cellfun(@(stopcodon)strfind(x_conv,stopcodon),{'ATC','ACT','ATT'},'UniformOutput',false);
End_loc = horzcat(End_loc{:});
n=0;
while n<numel(Start_loc)
n=n+1;
this_start=Start_loc(n);
%select all possible end codons
this_end=End_loc(End_loc>this_start);
%figure out which is the first end codon with an offset of 3
this_end=this_end(mod(this_end-this_start,3)==0);
this_end=this_end(1);
%now we need to remove elements in Start_loc that in the current gene
Start_loc(Start_loc>this_start & Start_loc<this_end)=[];
%store the end as well
End_loc(n)=this_end;
end
%remove extra values in End_loc
End_loc((n+1):end)=[];
genes=cell(size(End_loc));
for n=1:numel(End_loc)
genes{n}=x_conv(Start_loc(n):End_loc(n));
end

More Answers (0)

Categories

Find more on Genomics and Next Generation Sequencing in Help Center and File Exchange

Asked:

on 26 Nov 2020

Answered:

Rik
on 29 Nov 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!