Calculating CG content between delimeters

4 views (last 30 days)
Rokas
Rokas on 14 Dec 2014
Commented: Star Strider on 14 Dec 2014
Hi, I'm new to matlab and i'm having this problem. For exp:I have DNA sequences in one string separated by delimiter 'y'.
A= CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG y CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC y CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT
I need a loop that counts nucleotides(C,G) between delimeters 'y' separetly and puts answers to cell array.
b=0;
for i=1:length(A);
if A(i)~='y'; % if i is not equal delimeter
b=b+1; % at first i'm just trying to all count nucleotides
% i don't know how to put answer to cell
elseif A(i)=='y';
% if i equals delimeter 'y' count nucleotides from new until next 'y' and put answer to cell
end
end
Any bright ideas ?

Answers (1)

Star Strider
Star Strider on 14 Dec 2014
If you have the strsplit function, this is straightforward:
A = 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG y CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC y CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT';
Aseg = strsplit(A);
CGpos = strfind(Aseg(1:2:end), 'CG');
Without strsplit, it requires a loop:
yidx = [1 strfind(A, ' y ') length(A)]; % Find Delimiters
for k1 = 1:length(yidx)-1
Aseg{k1} = A(yidx(k1):yidx(k1+1));
if k1 > 1
Aseg{k1} = Aseg{k1}(3:end);
end
end
CGpos = strfind(Aseg,'CG');
Note that this finds the occurrences of 'CG' together. If you want individual occurrences of cytosine and guanine (and you’ve used strsplit), search for them individually:
Cpos = strfind(Aseg(1:2:end), 'C');
Gpos = strfind(Aseg(1:2:end), 'G');
To find the numbers of occurrences in each sequence (regardless of the way you calculated ‘CGpos’):
CGsum = cellfun(@length, CGpos);
So, use strsplit if you have it to avoid the loop. Note that to calculate CGpos, the argument to strfind is different if you use the loop or if you use strsplit.
  3 Comments
Rokas
Rokas on 14 Dec 2014
CG=[Csum; Gsum]
CGsum=sum(CGsum);
oh my it was simple.
Star Strider
Star Strider on 14 Dec 2014
My pleasure!
(The most sincere experssion of thanks here on MATLAB Answers is to Accept the Answer that most closely solves your problem.)
—————
Note that for the rest of this, the variables assume:
% With ‘strsplit’:
A = {A}; % ‘A’ As Cell Array Here
Qseg = strsplit(A{:});
Aseg = Qseg(1:2:end);
That said . . .
Don’t add ‘CGpos’, ‘Cpos’, or ‘Gpos’ to ‘CGsum’. They’re different variable types. The ‘CGpos’, (and ‘Cpos’, and ‘Gpos’) are cell arrays, and ‘CGsum’ (and ‘Csum’ and ‘Gsum’ respectively) are double arrays with the number of occurrences for each sequence in each element.
If you want to sum the ‘C’ and ‘G’ occurrences separately, I would keep the sums separate:
Cpos = strfind(Aseg, 'C');
Gpos = strfind(Aseg, 'G');
Csum = cellfun(@length, Cpos);
Gsum = cellfun(@length, Gpos);
then sum them individually if you want. In my code, ‘CGsum’, ‘Csum’, and ‘Gsum’ are entirely different. As written here, they give you the numbers for ‘CG’ (or ‘C’ and ‘G’ individually) in each sequence. If you want the total numbers in all sequences, just sum them:
SCsum = sum(Csum);
SGsum = sum(Gsum);
The same would be for the sum of the ‘CG’ sequences:
CGsum = cellfun(@length, CGpos);
SCGsum = sum(CGsum);
That should work without problems.

Sign in to comment.

Categories

Find more on Structures in Help Center and File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!