Fractal visualisation of genomic base counts
Sam Roberts
Contents
Introduction
Genomic researchers are often interested in counts of the nucleotide base in a genetic sequence. The Bioinformatics Toolbox contains functions for counting sequences of length one and two, ie individual bases and dimers.
We demonstrate these using a mitochondrial DNA sequence.
load mitochondria
mitochondria(1:10)
basecount(mitochondria)
dimercount(mitochondria)
ans =
GATCACAGGT
ans =
A: 5113
C: 5192
G: 2180
T: 4086
ans =
AA: 1594
AC: 1495
AG: 801
AT: 1223
CA: 1536
CC: 1779
CG: 439
CT: 1438
GA: 615
GC: 716
GG: 427
GT: 421
TA: 1368
TC: 1202
TG: 512
TT: 1004
Visualising the counts
These functions also have options for visualising the counts.
figure subplot(1,2,1) basecount(mitochondria,'Chart','pie'); subplot(1,2,2) dimercount(mitochondria,'Chart','bar');
Longer sequences
If we want to count longer sequences, we can use nmercount:
counts = nmercount(mitochondria,3); for i = 1:10 disp([counts{i,1},': ',num2str(counts{i,2})]); end
CCC: 632 CCT: 542 AAA: 522 CTA: 522 ACC: 518 AAC: 491 CAA: 464 CCA: 463 CAC: 454 ACA: 446
But there is currently no way of visualising these counts. We turn to some work by Carr [1] that advocates using a fractal visualisation, which is extensible (in theory) to arbitrarily long sequences. To visualise counts of length n, we construct a Sierpinski gasket of depth n.
What is a Sierpinski gasket?
A Sierpinski gasket is a simple fractal. A gasket of depth 0 is simply a tetrahedron. To construct a gasket of depth 1, the tetrahedron is replaced by four smaller tetrahedra, each of half the side length of the original, placed at each corner of the original. This leaves a hole in the center. To construct deeper gaskets, keep replacing each tetrahedron with four smaller ones.
How does this relate to genetic sequences?
There are four nucleotide bases: A, C, G and T - conveniently, the same number as the number of vertices of a tetrahedron. Base sequences of length n are mapped onto subgaskets as follows: The first base in the sequence maps onto one of the four tetrahedra of depth 1. This contains four sub-subgaskets. The second base in the sequence maps onto one of these sub-subgaskets. And so on. We color the gasket by the count of that sequence, and set all the gaskets to have a transparency of 0.1.
For length 3:
nmervis(mitochondria,3)
And for length 4:
nmervis(mitochondria,4)
Sequences of length 5 and 6 are fine; 7 might be a task for DCT.
This is nicely rotatable. The Sierpinski gasket has a nice property that when it is viewed from above, all the tetrahedra line up to form a square. You can achieve this by manually rotating the previous plot, or by calling as follows.
nmervis(mitochondria,3,'View','flat')
For longer sequences, it is useful to also scale the transparency of the tetrahedra by the count, to focus sttention on the common sequences.
nmervis(mitochondria,5,'ScaleAlpha','on')
Reference
Carr DB, 2002: 2D and 3D coordinates for m-mers and dynamic graphics. Computing Science and Statistics: Proceedings of the 34th Symposium on the Interface.
