Skip to Main Content Skip to Search
Home |   Select Country  Choose Country  |  Contact Us  |  Cart Store 
Create Account | Log In
Products & Services Industries Academia Support User Community Company
spacer spacer spacer spacer spacer spacer

 

Newsletters - MATLAB News & Notes

Studying Web-Based DNA Sequence Data Directly from MATLAB

by Rob Henson

March 2003 marks the 50th anniversary of perhaps the most important scientific discovery of the 20th Century: The discovery of the structure of DNA by James Watson and Francis Crick.

Since then, our understanding of life at the molecular level has
progressed spectacularly and we now have the full script of the human genome. In every human cell, there is DNA comprised of about 3 billion (3x109) nucleotides, or bases, called adenine, cytosine, guanine, and thymine (A,C,G, and T.) However, only some of these pieces of data are actually used by our cells; there are long stretches of genetic sequence that do not appear to be employed when genetic data is translated into a working protein. Although we have made much progress in understanding the human genome, we are still unable to predict exactly where genes start and the so-called "junk"DNA stops. This is an active area of research by computational biologists who use machine-learning techniques to try to understand the grammar of DNA.

The National Center for Biotechnology Information (NCBI) in the United States provides free access to the latest sequence information. The NCBI Web site (www.ncbi.nlm.nih.gov) is a fascinating place to learn about developments in genomics and computational biology.

The new Web connectivity functions and regular expression tools in MATLAB 6.5 make it easy to access the Web-based sequence data directly from within MATLAB. In this example, we will look at part of the sequence for the human X chromosome around the red opsin gene, which is responsible for red cone pigment in the eye. A defect in this gene leads to common red/green color blindness. First we build up the URL string to access the NCBI site. The string is a little messy but we can customize it by inserting the ID number of the gene.

ID = 1122283;
urlRoot = ['http://www.ncbi.nlm.nih.gov/entrez/',...
'query.fcgi?cmd=Retrieve&db=nucleotide',...
'&list_uids=%d&dopt=GenBank'];
% Use sprintf to insert the ID number into
% the URL string
url = sprintf(urlRoot,ID);

The MATLAB Web command opens the Help browser at a specified URL:

web(url)
figure 1
Trace data from a sequencing experiment. The different colors show the intensity of four fluorescent markers, corresponding to the four nucleotides, A, C, G, and T, along a sample of DNA. Click on image to see enlarged view.

You will see that a couple of hundred lines at the bottom of the page gives the sequence. Of this, only a small part is the actual gene. About half way down the file is a line that starts with CDS. This stands for CoDing Sequence and indicates the elements of the sequence that belong to the gene.

CDS join(5584..5695,11954..12250,14239..14407,
15875..16040,17595..17834,20117..20227)

Let's pull the text into MATLAB using urlread: text = urlread(url); Now we use the new regular expression functions, regexp and regexprep, to pull out the interesting data. Regular expressions are extremely powerful for searching a string but they can be a little arcane. Type doc regexp in MATLAB for a detailed introduction to using them.

First we look for the tag ORIGIN to indicate where the sequence begins and // to indicate where it ends. We then use regexprep to remove the unwanted characters.

% Use regexprep to extract everything after
% ORIGIN up until the terminating //.
seq = regexprep(text,'.*?ORIGIN(.*?//).*', '$1',... 'tokenize');
% Strip out numerals, whitespace and other
% unwanted characters
seq = regexprep(seq,'[0-9|\W]','');

seq is now a MATLAB char array of the whole sequence.

To look just at the sequence of the gene, we can use regexp to get the indices of the gene's elements from the CDS line.

geneChunk = regexprep(text,'.*?join\((.*?)\).*',...
'$1', 'tokenize')
geneChunk = regexprep(geneChunk,'\s','');
geneIndices = str2num(strrep(geneChunk,'..',':'));
gene = seq(geneIndices)

For reasons related to the chemical properties of cytosine and guanine, it is very uncommon for Cs to be followed by Gs in vertebrate DNA sequences. Such a combination is called a CpG dinucleotide. These are mostly found around the start of genes.

We can easily create a matrix of the dinucleotide frequencies in the following sequence: numseq = (seq=='a') + 2*(seq=='c') + ...
3*(seq=='g') + 4*(seq =='t');
% Use a trick with sparse matrices to
% count the number of each transition.
% See help sparse for more details.
transitions = full(sparse(numseq(1:end-1),...
numseq(2:end),1));
proportions = transitions/(length(seq)-1)

The (2,3) element of proportions (0.0226) corresponds to the proportion of CpG dinucleotides in the sequence. This is about a third of what would be expected if the dinucleotides were randomly distributed. If we repeat this calculation of the dinucleotide frequencies for just the gene numgene = numseq(geneIndices);
geneTransitions = full(sparse(numgene (1:end-1),...
numgene (2:end),1));
genelen = length(numgene)- 1;
geneProportions = geneTransitions /(genelen)
we see that the proportion of CpG dinucleotides (0.0393) is noticeably higher in the gene than in the general DNA.

Finally, if we look at the region of 70 nucleotides just before the gene starts

upstream = numseq(gene(1)-70:gene(1)-1);
utransitions = full(sparse(upstream(1:end-1),...
upstream(2:end),1))
utransitions(2,3)/length(upstream)

We see a much higher (0.0857) CpG concentration. Such areas of DNA with far higher than normal CpG concentration are known as CpG islands. These are frequently found close to the start of genes in a distinctly nonrandom distribution. A CpG island in a newly sequenced fragment of DNA is one of the strongest indicators that the fragment may contain the start of a gene. State-of-the-art tools, such as GENSCAN (http://genes.mit.edu/GENSCAN.html), use CpG islands and a myriad of other technologies to find genes in novel sequences with almost 90% accuracy.

figure 2
Red and Green DNA. The M-files used to create this image can be downloaded from MATLAB Central.


Shortly after this article was published, the GenBank database at the National Center for Biotechnology Information (NCBI) was updated to include the latest information from the Human Genome Project. As part of this update, the page referenced in this article was modified. As a consequence of this, the code in the article no longer works correctly. This problem can be overcome by changing the line ID = 1122283; to ID = 222856; Now, instead of accessing the information for the human rhodopsin gene, this new ID references the chicken (Gallus gallus) rhodopsin gene. The human rhodopsin gene and the chicken rhodopsin gene are very similar but not identical so the numerical values returned by the code in the article will be slightly different.



For more information visit:

Next Article

Contact sales
E-mail this page
Print this page
Subscribe to newsletters