# problem calculating nucleotide percentages

7 views (last 30 days)
John Jamison on 22 Apr 2017
Edited: dpb on 24 Apr 2017
Hi all,
I am writing a program to do several things, first off to calculate the percentage of 4 nucleotides of RNA (A, U, C, G), given a DNA input(A, C, G, T), and output it to a new file, showing the RNA sequence and nucleotide percentages (ie. 50% U). I have the following code, but I cannot get it to correctly calculate the frequency—I don't know how to get the total number of nucleotides, since I can't get it work with length(filename)....how Can I fix this? See code below.
Thanks
%DNA Transcription
%Goal: read in a file to convert from DNA to RNA:
% C -> G
% G -> C
% T -> A
% A -> U
dnaFile = 0;
while dnaFile < 1
filename = input('Open file: ', 's');
[dnaFile, message] = fopen(filename, 'r');
if dnaFile == -1
disp(message)
end
end
%if input file has extension .dna, switch to .rna
if length(filename) > 4 && strcmp(filename(end-3 : end), '.dna')
outputFile = [filename(1:end-4) '.rna']
else
outputFile = [filename '.rna']
end
[rnaFile, message] = fopen(outputFile, 'w');
[base, num] = fread(dnaFile, 1, 'char');
countG = 0;
countC = 0;
countA = 0;
countU = 0;
while num > 0
if base == 'C'
out = 'G';
countG = countG + 1;
elseif base == 'G'
out = 'C';
countC = countC + 1;
elseif base == 'T'
out = 'A';
countA = countA + 1;
elseif base == 'A'
out = 'U';
countU = countU + 1;
else
out = '_';
end
freqG = countG ./ length(dnaFile);
freqC = countC ./ length(dnaFile);
freqA = countA ./ length(dnaFile);
freqU = countU ./ length(dnaFile);
%fprintf('percentG: %.5f \n percentC: %.5f \n percentA: %.5f \n percentU: %.5f \n', freqG, freqC, freqA, freqU);
fwrite(rnaFile, out);
[base, num] = fread(dnaFile, 1, 'char');
end
fclose(dnaFile);
fclose(rnaFile);
dpb on 23 Apr 2017
"...I can't get it work with length(filename)..."
...
freqG = countG ./ length(dnaFile);
...
dnaFile is a file handle which is an integer stored in default datatype of double (see doc for fopen). Hence, length()--> 1
The length is either sum of all the characters you found if the percentage is to be based on the total string length including the extraneous or the sum of the four sequence counts if not. As shown, however, you can get these more efficiently than by explicit looping.

dpb on 22 Apr 2017
Edited: dpb on 23 Apr 2017
[rnaFile, message] = fopen(outputFile, 'w');
While not your problem, should check for the output file opening successfully as well as input...
[base, num] = fread(dnaFile, 1, 'char');
The above reads one character, but returns it as a double, not a character.
Use
to read whole file into a character array.
while num > 0
if base == 'C'
...
The above starts an infinite loop as num=1 and there's nothing inside the loop that ever changes num so it stays there forever...oh, although it will eventually error out on EOF on the read...scanned too quickly first.
You can loop, but Matlab is vectorized; may as well make use of it.
% rewrite the rules for convenience C -> G; G -> C; T -> A; A -> U
out=repmat('_',size(base)); % this is else case...we'll overwrite everything besides
out(base=='C')='G'; % use logical addressing to locate each and write output
out(base=='G')='C';
out(base=='T')='A';
out(base=='A')='U';
freqG=sum(out=='G');
freqC=sum(out=='C');
freqA=sum(out=='A');
freqU=sum(out=='U');
totalNum=sum([freqG freqC freqA freqU]);
freqG=sum(out=='G')/totalNum;
freqC=sum(out=='C')/totalNum;
freqA=sum(out=='A')/totalNum;
freqU=sum(out=='U')/totalNum;
Couldn't see an issue otomh so did a trial with just made up sequence...
>> dna=repmat(dna,1,10); % make longer sequence from them
>> dna=dna(randperm(length(dna))); % and then scramble 'em up
>> dna(randperm(length(dna),3))='_'; % a few other characters for spice
dna =
CTTGCCTCC_GGA_ATGAATGCAACACAGTT_GGTGCATC
The algorithm above starts here:
>> rna=repmat('_',size(base)); % replace the non-wanted letters
>> rna(dna=='C')='G';
>> rna(dna=='G')='C';
>> rna(dna=='T')='A';
>> rna(dna=='A')='U';
>> [dna;rna] % see what we got...
ans =
CTTGCCTCC_GGA_ATGAATGCAACACAGTT_GGTGCATC
GAACGGAGG_CCU_UACUUACGUUGUGUCAA_CCACGUAG
Looks like what problem statement asked for...compute frequency
Did this in "more Matlab-y" way; order is alphabetic to satisfy histc
>> RNA='ACGU'; % the letters to use as bin centers must be increasing order
>> freq=histc(rna,RNA)
freq =
9 9 10 9
>> freq=freq/sum(freq)
freq =
0.2432 0.2432 0.2703 0.2432
>>
>> [RNA.' num2str(freq.','%.4f')] % display results tabulated
ans =
A 0.2432
C 0.2432
G 0.2703
U 0.2432
>>
If order is important revert to previous or a "cute" way would be to use the categorical datatype--
>> rnac=categorical(cellstr(rna),{'G';'C';'A';'U';'_'});
>> summary(rnac)
G 10
C 9
A 9
U 9
_ 3
>>
And, the yet more Matlab-y way vectorizes the translation via lookup...
>> DNA=['C','G','T','A','_']; % base characters in sequence 1
>> RNA=['G','C','A','U','_']; % corresponding characters in sequence 2
>> RNA(arrayfun(@(c) find(c==DNA),dna)) % translate one to other...
ans =
GAACGGAGG_CCU_UACUUACGUUGUGUCAA_CCACGUAG
dpb on 24 Apr 2017
Edited: dpb on 24 Apr 2017
So, what was the problem?
"Enquiring minds" and all that... :)