Skip to Main Content Skip to Search
Home |   Select Country  Choose Country  |  Contact Us  |  Cart Store 
Create Account | Log In
Products & Services Solutions Academia Support User Community Company
spacer spacer spacer spacer spacer spacer

 

Bioinformatics Toolbox 3.4

Calculating and Visualizing Sequence Statistics

This demonstration shows basic sequence manipulation techniques and computes some useful sequence statistics. It also illustrates how to look for coding regions (such as proteins) and pursue further analysis of them.

Contents

The Human Mitochondrial Genome

In this demonstration you will explore the DNA sequence of the human mitochondria. Mitochondria are structures, called organelles, that are found in the cytoplasm of the cell in hundreds to thousands for each cell. Mitochondria are generally the major energy production center in eukaryotes, they help to degrade fats and sugars. The Genome repository at the NCBI contains more interesting information about the human mitochondrial genome.

The consensus sequence of the human mitochondria genome has accession number NC_001807. The whole GenBank® entry is large and this example only uses the nucleotide sequence, so you can use the getgenbank function with the 'SequenceOnly' flag to read just the sequence information into the MATLAB® workspace.

mitochondria = getgenbank('NC_001807','SequenceOnly',true);

If you don't have a live web connection, you can load the data from a MAT-file using the command

% load mitochondria     % <== Uncomment this if no internet connection

The MATLAB whos command gives information about the size of the sequence.

whos mitochondria
  Name              Size               Bytes  Class    Attributes

  mitochondria      1x16571            33142  char

You can look at the composition of the nucleotides with the ntdensity function.

ntdensity(mitochondria)

This shows that the mitochondria genome is A-T rich. The GC-content is sometimes used to classify organisms in taxonomy, it may vary between different species from ~30% up to ~70%. Measuring GC content is also useful for identifying genes and for estimating the annealing temperature of DNA sequence.

Calculating Sequence Statistics

Now, you will use some of the sequence statistics functions in the Bioinformatics Toolbox™ to look at various properties of the human mitochondrial genome. You can count the number of bases of the whole sequence using the basecount function.

bases = basecount(mitochondria)
bases =

    A: 5113
    C: 5192
    G: 2180
    T: 4086

These are on the 5'-3' strand. You can look at the reverse complement case using the seqrcomplement function.

compBases = basecount(seqrcomplement(mitochondria))
compBases =

    A: 4086
    C: 2180
    G: 5192
    T: 5113

As expected, the base counts on the reverse complement strand are complementary to the counts on the 5'-3' strand.

You can use the chart option to basecount to display a pie chart of the distribution of the bases.

figure
basecount(mitochondria,'chart','pie');
title('Distribution of Nucleotide Bases for Human Mitochondrial Genome');

Now look at the dimers in the sequence and display the information in a bar chart using dimercount.

figure
dimers = dimercount(mitochondria,'chart','bar')
title('Mitochondrial Genome Dimer Histogram');
dimers =

    AA: 1594
    AC: 1495
    AG: 801
    AT: 1223
    CA: 1536
    CC: 1779
    CG: 439
    CT: 1438
    GA: 615
    GC: 716
    GG: 427
    GT: 421
    TA: 1368
    TC: 1202
    TG: 512
    TT: 1004

You can look at codons using codoncount. The function dimercount simply counts all adjacent nucleotides; however codoncount counts the codons on a particular reading frame. With no options, the function shows the codon counts on the first reading frame.

mitochondriaCodons = codoncount(mitochondria)
mitochondriaCodons =

    AAA: 172
    AAC: 157
    AAG: 67
    AAT: 123
    ACA: 153
    ACC: 163
    ACG: 42
    ACT: 130
    AGA: 58
    AGC: 90
    AGG: 50
    AGT: 43
    ATA: 132
    ATC: 103
    ATG: 57
    ATT: 96
    CAA: 166
    CAC: 167
    CAG: 68
    CAT: 135
    CCA: 146
    CCC: 215
    CCG: 50
    CCT: 182
    CGA: 33
    CGC: 60
    CGG: 18
    CGT: 20
    CTA: 187
    CTC: 126
    CTG: 52
    CTT: 98
    GAA: 68
    GAC: 62
    GAG: 47
    GAT: 39
    GCA: 67
    GCC: 87
    GCG: 23
    GCT: 61
    GGA: 53
    GGC: 61
    GGG: 23
    GGT: 25
    GTA: 61
    GTC: 49
    GTG: 26
    GTT: 36
    TAA: 136
    TAC: 127
    TAG: 82
    TAT: 107
    TCA: 143
    TCC: 126
    TCG: 37
    TCT: 103
    TGA: 64
    TGC: 35
    TGG: 27
    TGT: 25
    TTA: 115
    TTC: 113
    TTG: 37
    TTT: 99

Using a loop you can also look at all the other reading frames:

codons = cell(2,3);
for frame = 1:3
    figure
    subplot(2,1,1);
    codons{1,frame} = codoncount(mitochondria,'frame',frame,'figure',tr
ue);
    title(sprintf('Codons for Frame %d of Human Mitochondria Genome',frame))
;
    subplot(2,1,2);
    codons{2,frame} = codoncount(mitochondria,'reverse',true,'frame',fr
ame,'figure',true);
    title(sprintf('Codons for Reverse Frame %d of Human Mitochondria Genome'
,frame));
end

Exploring the Open Reading Frames (ORFs)

In a nucleotide sequence an obvious thing to look for is if there are any open reading frames. An ORF is any sequence of DNA or RNA that can be potentially translated into a protein. The function seqshoworfs can be used to visualize ORFs in a sequence.

Note: In the HTML tutorial only the first page of the output is shown, however when running the demo you will be able to inspect the complete mitochondrial genome using the scrollbar on the figure.

seqshoworfs(mitochondria);

If you compare this output to the genes shown on the NCBI page there seem to be slightly fewer ORFs, and hence fewer genes, than expected. Vertebrate mitochondria do not use the Standard genetic code so some codons have different meaning in mitochondrial genomes. For more information about using different genetic codes in MATLAB see the help for the function geneticcode.

help geneticcode
 GENETICCODE returns a structure containing mappings for the genetic code.

    MAP = GENETICCODE returns a structure containing mapping for the Standar
d
    genetic code.

    GENETICCODE(ID) returns a structure of the mapping for alternate genetic
    codes, where ID is either the transl_table ID from the NCBI Genetics web
    page (http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c)
    or one of the following supported names. NAME can be truncated to the
    first two characters of the name.

    ID  Name

      1    Standard
      2    Vertebrate Mitochondrial
      3    Yeast Mitochondrial
      4    Mold, Protozoan, and Coelenterate Mitochondrial and Mycoplasma/Sp
iroplasma
      5    Invertebrate Mitochondrial
      6    Ciliate, Dasycladacean and Hexamita Nuclear
      9    Echinoderm Mitochondrial
      10    Euplotid Nuclear
      11    Bacterial and Plant Plastid
      12    Alternative Yeast Nuclear
      13    Ascidian Mitochondrial
      14    Flatworm Mitochondrial
      15    Blepharisma Nuclear
      16    Chlorophycean Mitochondrial
      21    Trematode Mitochondrial
      22    Scenedesmus Obliquus Mitochondrial
      23    Thraustochytrium Mitochondrial

    Examples:

        moldcode = geneticcode(4)
        wormcode = geneticcode('Flatworm Mitochondrial')

    See also AA2NT, AMINOLOOKUP, BASELOOKUP, CODONBIAS, DNDS, DNDSML,
    NT2AA, REVGENETICCODE, SEQSHOWORFS, SEQSTATSDEMO, SEQTOOL.

    Reference page in Help browser
       doc geneticcode

The 'GeneticCode' option to the seqshoworfs function allows you to look at the ORFs again but this time with the vertebrate mitochondrial genetic code. Notice that there are now two much larger ORFs on the first reading frame: One starting at position 4471 and the other starting at 5905. These correspond to the ND2 (NADH dehydrogenase subunit 2) and COX1 (cytochrome c oxidase subunit I) genes.

orfs = seqshoworfs(mitochondria,'GeneticCode','Vertebrate Mitochondrial',...
        'AlternativeStartCodons',true)
orfs =

1x3 struct array with fields:
    Start
    Stop

Extracting and Analyzing the ND2 Protein

The ORF of interest starts at position 4471, the following commands can be used to find the corresponding stop codon:

ND2Start = 4471;
startIndex = find(orfs(1).Start == ND2Start)
ND2Stop = orfs(1).Stop(startIndex)
startIndex =

    24


ND2Stop =

        5512

Once the positions are known, MATLAB indexing can be used to extract the region of interest.

ND2Seq = mitochondria(ND2Start:ND2Stop);

If you look at the codoncount for this gene we see a lot of CTA and ATC codons.

codonND2 = codoncount(ND2Seq)
codonND2 =

    AAA: 10
    AAC: 14
    AAG: 2
    AAT: 6
    ACA: 11
    ACC: 24
    ACG: 3
    ACT: 5
    AGA: 0
    AGC: 4
    AGG: 0
    AGT: 1
    ATA: 22
    ATC: 24
    ATG: 2
    ATT: 8
    CAA: 8
    CAC: 3
    CAG: 2
    CAT: 1
    CCA: 4
    CCC: 12
    CCG: 2
    CCT: 5
    CGA: 0
    CGC: 3
    CGG: 0
    CGT: 1
    CTA: 26
    CTC: 18
    CTG: 4
    CTT: 7
    GAA: 5
    GAC: 0
    GAG: 1
    GAT: 0
    GCA: 8
    GCC: 7
    GCG: 1
    GCT: 4
    GGA: 5
    GGC: 7
    GGG: 0
    GGT: 1
    GTA: 3
    GTC: 2
    GTG: 0
    GTT: 3
    TAA: 0
    TAC: 8
    TAG: 0
    TAT: 2
    TCA: 7
    TCC: 11
    TCG: 1
    TCT: 4
    TGA: 10
    TGC: 0
    TGG: 1
    TGT: 0
    TTA: 8
    TTC: 7
    TTG: 1
    TTT: 8

For those of you who have not memorized the genetic code you can easily check what amino acids these codons get translated into using the nt2aa and aminolookup functions.

CTA_aa = aminolookup('code',nt2aa('CTA'))
ATC_aa = aminolookup('code',nt2aa('ATC'))
CTA_aa =

Leu    Leucine



ATC_aa =

Ile    Isoleucine


The nt2aa function converts the nucleotide sequence to the corresponding amino acid sequence. Again the 'GeneticCode' option must be used to specify the vertebrate mitochondrial genetic code.

ND2 = nt2aa(ND2Seq,'GeneticCode','Vertebrate Mitochondria');

You can get a more complete picture of the amino acid content with aacount.

figure
ND2aaCount = aacount(ND2,'chart','bar')
title('Histogram of Amino Acid Count for Human Mitochondrial Genome');
ND2aaCount =

    A: 20
    R: 4
    N: 20
    D: 0
    C: 0
    Q: 10
    E: 6
    G: 13
    H: 4
    I: 31
    L: 64
    K: 12
    M: 25
    F: 15
    P: 23
    S: 28
    T: 43
    W: 11
    Y: 10
    V: 8

Notice the high leucine, threonine and isoleucine content and also the lack of cysteine or aspartic acid.

You can use the atomiccomp and molweight functions to find out more about the ND2 protein.

ND2AtomicComp = atomiccomp(ND2)
ND2MolWeight = molweight(ND2)
ND2AtomicComp =

    C: 1818
    H: 2882
    N: 420
    O: 471
    S: 25


ND2MolWeight =

  3.8960e+004

For further investigation of the properties of the ND2 protein, try using proteinplot. This is a graphical user interface (GUI) that allows you to easily create plots of various properties, such as hydrophobicity, of a protein sequence. Click on the "Edit" menu to create new properties, to modify existing property values, or, to adjust the smoothing parameters. Click on the "Help" menu in the GUI for more information on how to use the tool.

proteinplot(ND2)

You can also create plots of various properties of the sequence using proteinpropplot.

proteinpropplot(ND2,'PropertyTitle','Parallel beta strand')

Inspecting Other Annotated Features

You can also look at all the features that have been annotated to the human mitochondrial genome. Download the complete GenBank entry 'NC_001807' and explore its features with the featuresparse function.

mitochondria_gbk = getgenbank('NC_001807');
features = featuresparse(mitochondria_gbk)
genes = {features.gene.gene}
ND2annotations = features.gene(10) % ND2 is in the 10th position
features =

    source: [1x1 struct]
    D_loop: [1x1 struct]
      gene: [1x37 struct]
      tRNA: [1x22 struct]
      rRNA: [1x2 struct]
       STS: [1x26 struct]
       CDS: [1x13 struct]


genes =

  Columns 1 through 7

    'TRNF'    'RNR1'    'TRNV'    'RNR2'    'TRNL1'    'ND1'    'TRNI'

  Columns 8 through 14

    'TRNQ'    'TRNM'    'ND2'    'TRNW'    'TRNA'    'TRNN'    'TRNC'

  Columns 15 through 21

    'TRNY'    'COX1'    'TRNS1'    'TRND'    'COX2'    'TRNK'    'ATP8'

  Columns 22 through 28

    'ATP6'    'COX3'    'TRNG'    'ND3'    'TRNR'    'ND4L'    'ND4'

  Columns 29 through 35

    'TRNH'    'TRNS2'    'TRNL2'    'ND5'    'ND6'    'TRNE'    'CYTB'

  Columns 36 through 37

    'TRNT'    'TRNP'


ND2annotations =

        Location: '4471..5512'
         Indices: [4471 5512]
            gene: 'ND2'
    gene_synonym: 'MTND2'
    nomenclature: [1x99 char]
         db_xref: {'GeneID:4536'  'HGNC:7456'}
            note: []

Create a map indicating all the features found in this GenBank entry using the featuresmap function.

[h,l] = featuresmap(mitochondria_gbk,{'CDS','tRNA','rRNA','D_loop'},...
                    [2 1 2 2 2],'Fontsize',9);
legend(h,l,'interpreter','none');
title('Homo sapiens mitochondrion, complete genome')

Suggest an enhancement for this demo.

Contact sales
Free technical kit
Trial software
E-mail this page

Get Pricing and
Licensing Options