image thumbnail
from Fractal visualisation of genomic sequences by Sam Roberts
Visualise n-mer nucleotide sequences using a fractal visualisation based on a Sierpinski gasket.

Fractal visualisation of genomic base counts

Fractal visualisation of genomic base counts

Sam Roberts

Contents

Introduction

Genomic researchers are often interested in counts of the nucleotide base in a genetic sequence. The Bioinformatics Toolbox contains functions for counting sequences of length one and two, ie individual bases and dimers.

We demonstrate these using a mitochondrial DNA sequence.

load mitochondria

mitochondria(1:10)

basecount(mitochondria)

dimercount(mitochondria)
ans =

GATCACAGGT


ans = 

    A: 5113
    C: 5192
    G: 2180
    T: 4086


ans = 

    AA: 1594
    AC: 1495
    AG: 801
    AT: 1223
    CA: 1536
    CC: 1779
    CG: 439
    CT: 1438
    GA: 615
    GC: 716
    GG: 427
    GT: 421
    TA: 1368
    TC: 1202
    TG: 512
    TT: 1004

Visualising the counts

These functions also have options for visualising the counts.

figure
subplot(1,2,1)
basecount(mitochondria,'Chart','pie');

subplot(1,2,2)
dimercount(mitochondria,'Chart','bar');

Longer sequences

If we want to count longer sequences, we can use nmercount:

counts = nmercount(mitochondria,3);
for i = 1:10
    disp([counts{i,1},': ',num2str(counts{i,2})]);
end
CCC: 632
CCT: 542
AAA: 522
CTA: 522
ACC: 518
AAC: 491
CAA: 464
CCA: 463
CAC: 454
ACA: 446

But there is currently no way of visualising these counts. We turn to some work by Carr [1] that advocates using a fractal visualisation, which is extensible (in theory) to arbitrarily long sequences. To visualise counts of length n, we construct a Sierpinski gasket of depth n.

What is a Sierpinski gasket?

A Sierpinski gasket is a simple fractal. A gasket of depth 0 is simply a tetrahedron. To construct a gasket of depth 1, the tetrahedron is replaced by four smaller tetrahedra, each of half the side length of the original, placed at each corner of the original. This leaves a hole in the center. To construct deeper gaskets, keep replacing each tetrahedron with four smaller ones.

How does this relate to genetic sequences?

There are four nucleotide bases: A, C, G and T - conveniently, the same number as the number of vertices of a tetrahedron. Base sequences of length n are mapped onto subgaskets as follows: The first base in the sequence maps onto one of the four tetrahedra of depth 1. This contains four sub-subgaskets. The second base in the sequence maps onto one of these sub-subgaskets. And so on. We color the gasket by the count of that sequence, and set all the gaskets to have a transparency of 0.1.

For length 3:

nmervis(mitochondria,3)

And for length 4:

nmervis(mitochondria,4)

Sequences of length 5 and 6 are fine; 7 might be a task for DCT.

This is nicely rotatable. The Sierpinski gasket has a nice property that when it is viewed from above, all the tetrahedra line up to form a square. You can achieve this by manually rotating the previous plot, or by calling as follows.

nmervis(mitochondria,3,'View','flat')

For longer sequences, it is useful to also scale the transparency of the tetrahedra by the count, to focus sttention on the common sequences.

nmervis(mitochondria,5,'ScaleAlpha','on')

Reference

Carr DB, 2002: 2D and 3D coordinates for m-mers and dynamic graphics. Computing Science and Statistics: Proceedings of the 34th Symposium on the Interface.

Contact us at files@mathworks.com