| Products & Services | Solutions | Academia | Support | User Community | Company |
| Download Product Updates | | | Get Pricing | | | Trial Software |
| Documentation → Bioinformatics Toolbox |
| Contents | Index |
| Learn more about Bioinformatics Toolbox |
| On this page… |
|---|
Determining Nucleotide Content |
After sequencing a piece of DNA, one of the first tasks is to investigate the nucleotide content in the sequence. Starting with a DNA sequence, this example uses sequence statistics functions to determine mono-, di-, and trinucleotide content, and to locate open reading frames.
The following procedure illustrates how to use the MATLAB Help browser to search the Web for information. In this example you are interested in studying the human mitochondrial genome. While many genes that code for mitochondrial proteins are found in the cell nucleus, the mitochondrial has genes that code for proteins used to produce energy.
First research information about the human mitochondria and find the nucleotide sequence for the genome. Next, look at the nucleotide content for the entire sequence. And finally, determine open reading frames and extract specific gene sequences.
Use the MATLAB Help browser to explore the Web. In the MATLAB Command Window, type
web('http://www.ncbi.nlm.nih.gov/')
A separate browser window opens with the home page for the NCBI Web site.
Search the NCBI Web site for information. For example, to search for the human mitochondrion genome, from the Search list, select Genome, and in the for box, enter mitochondrion homo sapiens.

The NCBI Web search returns a list of links to relevant pages.

Select a result page. For example, click the link labeled NC_001807.
The MATLAB Help browser displays the NCBI page for the human mitochondrial genome.

The following procedure illustrates how to find a nucleotide sequence in a public database and read the sequence information into the MATLAB environment. Many public databases for nucleotide sequences are accessible from the Web. The MATLAB Command Window provides an integrated environment for bringing sequence information into the MATLAB environment.
The consensus sequence for the human mitochondrial genome has the GenBank accession number NC_001807. Since the whole GenBank entry is quite large and you might only be interested in the sequence, you can get just the sequence information.
Get sequence information from a Web database. For example, to retrieve sequence information for the human mitochondrial genome, in the MATLAB Command Window, type
mitochondria = getgenbank('NC_001807','SequenceOnly',true);
The getgenbank function retrieves the nucleotide sequence from the GenBank database and creates a character array.
mitochondria = gatcacaggtctatcaccctattaaccactcacgggagctctccatgcat ttggtattttcgtctggggggtgtgcacgcgatagcattgcgagacgctg gagccggagcaccctatgtcgcagtatctgtctttgattcctgcctcatt ctattatttatcgcacctacgttcaatattacaggcgaacatacctacta aagt . . .
If you don't have a Web connection, you can load the data from a MAT file included with the Bioinformatics Toolbox software, using the command
load mitochondria
The load function loads the sequence mitochondria into the MATLAB Workspace.
Get information about the sequence. Type
whos mitochondria
Information about the size of the sequence displays in the MATLAB Command Window.
Name Size Bytes Class mitochondria 1x16571 33142 char array Grand total is 16571 elements using 33142 bytes
The following procedure illustrates how to determine the monomers and dimers, and then visualize data in graphs and bar plots. Sections of a DNA sequence with a high percent of A+T nucleotides usually indicate intergenic parts of the sequence, while low A+T and higher G+C nucleotide percentages indicate possible genes. Many times high CG dinucleotide content is located before a gene.
After you read a sequence into the MATLAB environment, you can use the sequence statistics functions to determine if your sequence has the characteristics of a protein-coding region. This procedure uses the human mitochondrial genome as an example. See Reading Sequence Information.
Plot monomer densities and combined monomer densities in a graph. In the MATLAB Command Window, type
ntdensity(mitochondria)
This graph shows that the genome is A+T rich.

Count the nucleotides using the function basecount.
basecount(mitochondria)
A list of nucleotide counts is shown for the 5'-3' strand.
ans =
A: 5113
C: 5192
G: 2180
T: 4086
Count the nucleotides in the reverse complement of a sequence using the function seqrcomplement.
basecount(seqrcomplement(mitochondria))
As expected, the nucleotide counts on the reverse complement strand are complementary to the 5'-3' strand.
ans =
A: 4086
C: 2180
G: 5192
T: 5113
Use the function basecount with the chart option to visualize the nucleotide distribution.
basecount(mitochondria,'chart','pie');
A pie chart displays in the MATLAB Figure window.

Count the dimers in a sequence and display the information in a bar chart.
dimercount(mitochondria,'chart','bar')

The following procedure illustrates how to look at codons for the six reading frames. Trinucleotides (codon) code for an amino acid, and there are 64 possible codons in a nucleotide sequence. Knowing the percent of codons in your sequence can be helpful when you are comparing with tables for expected codon usage.
After you read a sequence into the MATLAB environment, you can analyze the sequence for codon composition. This procedure uses the human mitochondria genome as an example. See Reading Sequence Information.
Count codons in a nucleotide sequence. In the MATLAB Command Window, type
codoncount(mitochondria)
The codon counts for the first reading frame displays.
AAA-172 AAC-157 AAG-67 AAT-123 ACA-153 ACC-163 ACG-42 ACT-130 AGA-58 AGC-90 AGG-50 AGT-43 ATA-132 ATC-103 ATG-57 ATT-96 CAA-166 CAC-167 CAG-68 CAT-135 CCA-146 CCC-215 CCG-50 CCT-182 CGA-33 CGC-60 CGG-18 CGT-20 CTA-187 CTC-126 CTG-52 CTT-98 GAA-68 GAC-62 GAG-47 GAT-39 GCA-67 GCC-87 GCG-23 GCT-61 GGA-53 GGC-61 GGG-23 GGT-25 GTA-61 GTC-49 GTG-26 GTT-36 TAA-136 TAC-127 TAG-82 TAT-107 TCA-143 TCC-126 TCG-37 TCT-103 TGA-64 TGC-35 TGG-27 TGT-25 TTA-115 TTC-113 TTG-37 TTT-99
Count the codons in all six reading frames and plot the results in heat maps.
for frame = 1:3
figure('color',[1 1 1])
subplot(2,1,1);
codoncount(mitochondria,'frame',frame,'figure',true);
title(sprintf('Codons for frame %d',frame));
subplot(2,1,2);
codoncount(mitochondria,'reverse',true,...
'frame',frame,...
'figure',true);
title(sprintf('Codons for reverse frame %d',frame));
end
Heat maps display all 64 codons in the 6 reading frames.



The following procedure illustrates how to locate the open reading frames using a specific genetic code. Determining the protein-coding sequence for a eukaryotic gene can be a difficult task because introns (noncoding sections) are mixed with exons. However, prokaryotic genes generally do not have introns and mRNA sequences have the introns removed. Identifying the start and stop codons for translation determines the protein-coding section, or open reading frame (ORF), in a sequence. Once you know the ORF for a gene or mRNA, you can translate a nucleotide sequence to its corresponding amino acid sequence.
After you read a sequence into the MATLAB environment, you can analyze the sequence for open reading frames. This procedure uses the human mitochondria genome as an example. See Reading Sequence Information.
Display open reading frames (ORFs) in a nucleotide sequence. In the MATLAB Command Window, type:
seqshoworfs(mitochondria);
If you compare this output to the genes shown on the NCBI page for NC_001807, there are fewer genes than expected. This is because vertebrate mitochondria use a genetic code slightly different from the standard genetic code. For a list of genetic codes, see the Genetic Code table in the aa2nt reference page in the Bioinformatics Toolbox Reference.
Display ORFs using the Vertebrate Mitochondrial code.
orfs= seqshoworfs(mitochondria,...
'GeneticCode','Vertebrate Mitochondrial',...
'alternativestart',true);
Notice that there are now two large ORFs on the first reading frame. One starts at position 4471 and the other starts at 5905. These correspond to the genes ND2 (NADH dehydrogenase subunit 2 [Homo sapiens] ) and COX1 (cytochrome c oxidase subunit I) genes.
Find the corresponding stop codon. The start and stop positions for ORFs have the same indices as the start positions in the fields Start and Stop.
ND2Start = 4471; StartIndex = find(orfs(1).Start == ND2Start) ND2Stop = orfs(1).Stop(StartIndex)
The stop position displays.
ND2Stop =
5512
Using the sequence indices for the start and stop of the gene, extract the subsequence from the sequence.
ND2Seq = mitochondria(ND2Start:ND2Stop); codoncount (ND2Seq)
The subsequence (protein-coding region) is stored in ND2Seq and displayed on the screen.
attaatcccctggcccaacccgtcatctactctaccatctttgcaggcac actcatcacagcgctaagctcgcactgattttttacctgagtaggcctag aaataaacatgctagcttttattccagttctaaccaaaaaaataaaccct cgttccacagaagctgccatcaagtatttcctcacgcaagcaaccgcatc cataatccttc . . .
Determine the codon distribution.
codoncount (ND2Seq)
The codon count shows a high amount of ACC, ATA, CTA, and ATC.
AAA-10 AAC-14 AAG-2 AAT-6 ACA-11 ACC-24 ACG-3 ACT-5 AGA-0 AGC-4 AGG-0 AGT-1 ATA-22 ATC-24 ATG-2 ATT-8 CAA-8 CAC-3 CAG-2 CAT-1 CCA-4 CCC-12 CCG-2 CCT-5 CGA-0 CGC-3 CGG-0 CGT-1 CTA-26 CTC-18 CTG-4 CTT-7 GAA-5 GAC-0 GAG-1 GAT-0 GCA-8 GCC-7 GCG-1 GCT-4 GGA-5 GGC-7 GGG-0 GGT-1 GTA-3 GTC-2 GTG-0 GTT-3 TAA-0 TAC-8 TAG-0 TAT-2 TCA-7 TCC-11 TCG-1 TCT-4 TGA-10 TGC-0 TGG-1 TGT-0 TTA-8 TTC-7 TTG-1 TTT-8
Look up the amino acids for codons ATA, CTA, ACC, and ATC.
aminolookup('code',nt2aa('ATA'))
aminolookup('code',nt2aa('CTA'))
aminolookup('code',nt2aa('ACC'))
aminolookup('code',nt2aa('ATC'))
The following displays:
Ile isoleucine Leu leucine Thr threonine Ile isoleucine
The following procedure illustrates how to extract the protein-coding sequence from a gene sequence and convert it to the amino acid sequence for the protein. Determining the relative amino acid composition of a protein will give you a characteristic profile for the protein. Often, this profile is enough information to identify a protein. Using the amino acid composition, atomic composition, and molecular weight, you can also search public databases for similar proteins.
After you locate an open reading frame (ORF) in a gene, you can convert it to an amino sequence and determine its amino acid composition. This procedure uses the human mitochondria genome as an example. See Open Reading Frames.
Convert a nucleotide sequence to an amino acid sequence. In this example, only the protein-coding sequence between the start and stop codons is converted.
ND2AASeq = nt2aa(ND2Seq,'geneticcode',...
'Vertebrate Mitochondrial');
The sequence is converted using the Vertebrate Mitochondrial genetic code. Because the property AlternativeStartCodons is set to 'true' by default, the first codon att is converted to M instead of I.
MNPLAQPVIYSTIFAGTLITALSSHWFFTWVGLEMNMLAFIPVLTKKMNP RSTEAAIKYFLTQATASMILLMAILFNNMLSGQWTMTNTTNQYSSLMIMM AMAMKLGMAPFHFWVPEVTQGTPLTSGLLLLTWQKLAPISIMYQISPSLN VSLLLTLSILSIMAGSWGGLNQTQLRKILAYSSITHMGWMMAVLPYNPNM TILNLTIYIILTTTAFLLLNLNSSTTTLLLSRTWNKLTWLTPLIPSTLLS LGGLPPLTGFLPKWAIIEEFTKNNSLIIPTIMATITLLNLYFYLRLIYST SITLLPMSNNVKMKWQFEHTKPTPFLPTLIALTTLLLPISPFMLMIL
Compare your conversion with the published conversion in the GenPept database.
ND2protein = getgenpept('NP_536844','sequenceonly',true)
The getgenpept function retrieves the published conversion from the NCBI database and reads it into the MATLAB Workspace.
Count the amino acids in the protein sequence.
aacount(ND2AASeq, 'chart','bar')
A bar graph displays. Notice the high content for leucine, threonine and isoleucine, and also notice the lack of cysteine and aspartic acid.

Determine the atomic composition and molecular weight of the protein.
atomiccomp(ND2AASeq) molweight (ND2AASeq)
The following displays in the MATLAB Workspace:
ans =
C: 1818
H: 3574
N: 420
O: 817
S: 25ans = 3.8960e+004
If this sequence was unknown, you could use this information to identify the protein by comparing it with the atomic composition of other proteins in a database.
![]() | Sequence Analysis | Example: Sequence Alignment | ![]() |

Includes the most popular MATLAB recorded presentations with Q&A sessions led by MATLAB experts.
| © 1984-2009- The MathWorks, Inc. - Site Help - Patents - Trademarks - Privacy Policy - Preventing Piracy - RSS |