How can I save the beginning and end positions of each sequence in a cell array?
Show older comments
So I am looping through codons and recording them on a .txt file. The script works, but I need the sequence to begin at the starting codon position, stop at the end codon then continue through the cell array while recording all of the following start and end codon sequences. I would just like to know the best option I can use to tweak my code here. Thanks in advance!
fid = fopen("sequence_long2.txt",'r');
C = textscan(fid,'%3s');
x = C{1}
fclose(fid);
%Start sequence
ss = 1;
% end sequence
es = 183479;
seq_id = long_codon(x(ss:es));
function seq = long_codon(v)
seq = (v);
for pos = 1:length(seq)
if strcmp(seq{pos},'TAC')
index = find(strcmp(v,seq{pos}));
StartPos = index;
elseif (strcmp(seq{pos},'ACT') || strcmp(seq{pos},'ATT') || strcmp(seq{pos},'ATC'))
index = find(strcmp(v,seq{pos}));
EndPos = index;
end
end
fid2 = fopen('report_long.txt','w+');
fprintf(fid2,'Name: OP \n');
fprintf(fid2,'Lab 13: DNA Pattern Matching\n \n');
fprintf(fid2,'Start Position of Gene is: %d \n',StartPos);
fprintf(fid2, 'End Position of Gene is: %d \n',EndPos);
fclose(fid2);
end
14 Comments
dpb
on 26 Nov 2020
Need a more complete description of what input is and what is wanted...probably somebody knowledgeable in the field knows that, but not all of us are...
If there's a beginning sequence string like 'TAC' and ending alternate strings, why not just use strfind to locate each instead of character-by-character looping?
Austin Shipley
on 26 Nov 2020
Edited: Austin Shipley
on 26 Nov 2020
dpb
on 27 Nov 2020
The above code can't work -- neither if condition can ever be satisfied because you're trying to compare a string of length 1 (seq{pos}) to one of length 3 ('TAC') for the start condtion. Similarly for the subsequent ending conditions.
"strcmp(s1,s2) compares s1 and s2 and returns 1 (true) if the two are identical and 0 (false) otherwise. Text is considered identical if the size and content of each are the same. "
Also, it will fail because it doesn't require a START token be found first; it just goes on to the next character whether has found a beginning or not.
You need to stay in the first conditional loop until find a start; then start looking for an end.
It seems like strfind would be a lot more useful here.
There's a couple of interesting "features" in the string -- I dunno enuf about the subject to know if this is normal or not but--
seq=importdata('sequence_long2.txt');
ss = 1;
es = 183479;
seq=seq{:}(ss:es);
iStart=strfind(seq,'TAC');
>> whos iStart
Name Size Bytes Class Attributes
iStart 1x2405 19240 double
>> iEnd=cell2mat(regexp(seq,{'ACT|ATT|ATC'}));
>> whos iStart, iEnd
Name Size Bytes Class Attributes
iEnd 1x7493 59944 double
iStart 1x2405 19240 double
>>
Shows there are some 3X the number of End than Start. This expected???
>> iStart(1:20)
ans =
Columns 1 through 17
74 89 101 279 399 461 711 797 846 864 890 965 1013 1044 1289 1382 1404
Columns 18 through 20
1496 1614 1622
>> iEnd(1:20)
ans =
72 75 131 154 213 242 267 270 282 294 300 317 341 391 451 462 491 549 575 580
>> seq(iStart(1):iEnd(3))
ans =
'TACTCTGTTGCAACCTACGATGTCCCCTACATGCCAGAAAGTGGTGTAATAGTCGACA'
>>
Shows why the one-at-a-time stepping through the sequence won't work -- there's what looks like an 'ACT' stop code just one character after the first character of the first start code.
"TACT" contains "TAC" as 1:3 but "ACT" as 2:4.
If 'TAC' universally means a start codon (there can't be something embedded like the above that is the right sequence but not a possible valid start position), then I'd use the iStart array above and step through it looking for the end sequence.
It may be owing to the above kind of thing can't just compare that the next iEnd index is >3 from the iStart; I don't know if other permutations are allowable or not.
Austin Shipley
on 27 Nov 2020
Edited: Austin Shipley
on 27 Nov 2020
Rik
on 27 Nov 2020
Because you need to read in triplets, can't you filter away all end codons whose location is not a multiple of 3 away from the previous start codon? That would mean you need 1 call to strfind to find start codons, and a while loop to go through them. (not a for loop, since some start codons might occur in a protein with an offset, so not every location will be a start codon)
dpb
on 27 Nov 2020
"Because you need to read in triplets,..."
That was my first idea was to reshape() by [],3 but then I discover that there are start codons that aren't at a mutliple of three into the string...
As noted, I don't know enough about the actual gene structure sequence to know what is/isn't valid so what simplifying assumptions can be made.
Rik
on 27 Nov 2020
What I know about genetics is that from the start codon onwards the gene is read in triplets (ignoring introns and exons for now...), but from an end codon to the next start does not have to be a triplet.
Austin Shipley
on 27 Nov 2020
Edited: Austin Shipley
on 27 Nov 2020
Austin Shipley
on 27 Nov 2020
Rik
on 27 Nov 2020
Why not use strfind(x_con,'TAC')?
At least using this will clean up your syntax a tiny bit:
% Replace this
(x_conv(i) == 'T' && x_conv(i+1) == 'A' && x_conv(i+2) == 'C')
% With this
strcmp(x_conv(i+[0 1 2]),'TAC')
Also, you should probably not let your while loop run until the end of your vector:
while i<(numel(x_conv)-2)
Austin Shipley
on 27 Nov 2020
Austin Shipley
on 28 Nov 2020
Rik
on 28 Nov 2020
I would urge you to change to strfind first. Then you can loop through all start codons, removing later start codons if they are inside the gene being read.
Austin Shipley
on 28 Nov 2020
Edited: Austin Shipley
on 28 Nov 2020
Accepted Answer
More Answers (0)
Categories
Find more on Genomics and Next Generation Sequencing in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!