Skip to Main Content Skip to Search
Accelerating the pace of engineering and science

 

MATLAB Digest - January 2004

Using the Bioinformatics Toolbox to Analyze Data from the Human Genome Project

by Rob Henson

The new Bioinformatics Toolbox provides access to genomic and proteomic data formats, analysis techniques, and specialized visualizations. It is designed for implementing genomic and proteomic sequence and microarray analysis techniques. The toolbox enables you to access many standard file formats for biological data, Web-based databases, and other online data sources. It provides a number of routines for reading gene and protein data from standard file formats, such as FASTA, PDB, and SCF and also tools to read standard microarray data, such as Affymetrix, GenePix® format files. You can also directly interface with major Web-based databases, such as GenBank, EMBL, PIR, and PDB.

This article shows an example of how to use the Bioinformatics Toolbox to extract sequences from GenBank, and then align the sequences using global and local alignment algorithms.

Viewing Genome Data

The National Center for Biotechnology Information (NCBI) in the United States, the European Molecular Biology Laboratory (EMBL) and the DNA Data Bank of Japan (BBDJ) provide free access to the latest genome sequence information, which is updated daily as new sequences are submitted from around the world. The NCBI Web site is an interesting place to learn about the human genome and developments in genomics and computational biology. The DNA sequence repository, GenBank, contains over 30,000,000 sequences from many thousands of different organisms.

The MATLAB Help Browser can be utilized to explore this information; for example, you can view the NCBI Web site from within MATLAB.

web('http://www.ncbi.nlm.nih.gov/')

One of the many useful parts of the NCBI Web site is the Genes and Diseases section, which provides a comprehensive introduction to medical genetics.

web(['http://www.ncbi.nlm.nih.gov/books/bv.fcgi? '...
'call=bv.View..ShowSection&rid=gnd'])

In this demonstration you will be looking at genes associated with Tay-Sachs disease.

web(['http://www.ncbi.nlm.nih.gov/books/bv.fcgi?',...
'call=bv.View..ShowSection&rid=gnd.section.220'])

Tay-Sachs is an autosomal recessive disease caused by mutations in both alleles of a gene (HEXA, which codes for the alpha subunit of hexosaminidase A) on chromosome 15. The NCBI reference sequence for HEXA has accession number NM_000520.

web(['http://www.ncbi.nlm.nih.gov/'...
'entrez/viewer.fcgi?val=13128865&db=Nucleotide&dopt=GenBank'])


Figure 1: The GenBank entry for the HEXA gene. Click on image to see enlarged view.

Accessing NCBI data from the MATLAB workspace

You can use the getgenbank function to read the sequence information directly from GenBank into MATLAB. This creates a structure containing all of the information in the GenBank record.

humanHEXA = getgenbank('NM_000520')
humanHEXA =    
     
LocusName:   'NM_000520'
LocusSequenceLength:   '2255'
LocusNumberofStrands:   ' '
LocusTopology:   'linear'
LocusMoleculeType:   'mRNA'
LocusGenBankDivision:   'PRI'
LocusModificationDate:   '04-OCT-2003'
Definition:   [1x63 char]
Accession:   'NM_000520'
Version:   'NM_000520.2'
GI:   '13128865'
Keywords:   []
Segment:   []
Source:   'Homo sapiens (human)'
SourceOrganism:   [3x65 char]
Reference:   {1x 8 cell}
Comment:   [15x67 char]
Features:   [104x74 char]
Sequence:   [1x2255 char]

If you search in the mouse genome for the mouse hexosaminidase A gene, you will find that the GenBank accession number is AK080777.

mouseHEXA = getgenbank('AK080777')
mouseHEXA =    
     
LocusName:   'AK080777'
LocusSequenceLength:   '1839'
LocusNumberofStrands:   ' '
LocusTopology:   'linear'
LocusMoleculeType:   'mRNA'
LocusGenBankDivision:   'HTC'
LocusModificationDate:   '20-SEP-2003'
Definition:   [1x50 char]
Accession:   'AK080777'
Version:   'AK080777.1'
GI:   '26348756'
Keywords:   'HTC; CAP trapper.'
Segment:   []
Source:   'Mus musculus (house mouse)'
SourceOrganism:   [3x66 char]
Reference:   {1x6 cell}
Comment:   [12x66 char]
Features:   [33x74 char]
Sequence:   [1x1839 char]

Exploring the Open Reading Frames (ORFs)

Open reading frames (ORFs) are the regions in the genome that are potentially transcribed and converted to proteins. You can use the function seqshoworfs to look for ORFs in the sequence for the human HEXA gene. The biological process that transcribes DNA reads the sequence in chunks of three nucleotides at a time. These triplets are called codons. ORFs can start at any place in the sequence so there are three possible reading frames for the codons. In this case the longest ORF is on the third reading frame. The output value in the variable 'humanORFs' is a structure giving the position of the start and stop codons for all the ORFs on each reading frame.

humanORFs = seqshoworfs(humanHEXA.Sequence,'Frames',3);

Figure 2: Open Reading Frames for the mouse gene. Click on image to see enlarged view.

Similarly, we can look at the ORFs in the mouse HEXA gene. In this case the ORF is on frame 1.

mouseORFs = seqshoworfs(mouseHEXA.Sequence)

Figure 3: Open Reading Frames for the human gene. Click on image to see enlarged view.

Comparing the human and mouse sequences

Just by looking at the letters in these sequences, it is very hard to tell whether or not these two sequences are similar. We know that the genes they are associated with have similar function in two mammals, but it is of great interest to biologists to be able to quantify this similarity and to find out if there are particular regions in sequences that are identical.

The first thing that we will do is to use a simple graphical method to look for similarities between these sequences. You could look at the alignment between the nucleotide sequences but it is generally more instructive to look at the alignment between the associated protein sequences. The first thing that you need to do is to convert the nucleotide sequences into the corresponding amino acid sequences. Use the nt2aa function to do this. This is an in silico version of the transcription process that happens inside cell nuclei where DNA is transcribed to RNA, which is then used as the template for the manufacture of proteins from chains of amino acids.

mouseProtein = nt2aa(mouseHEXA.Sequence)  

mouseProtein =
AAGRGAGRWAMAGCRLWVSLLLAAALACLATALWPWPQYIQTYHRRYTLYPNNFQFRYHVSSAAQAGC
VVLDEAFRRYRNLLFGSGSWPRPSFSNKQQTLGKNILVVSVVTAECNEFPNLESVENYTLTINDDQCL
LASETVWGALRGLETFSQLVWKSAEGTFFINKTKIKDFPRFPHRGVLLDTSRHYLPLSSILDTLDVMA
YNKFNVFHWHLVDDSSFPYESFTFPELTRKGSFNPVTHIYTAQDVKEVIEYARLRGIRVLAEFDTPGH
TLSWGPGAPGLLTPCYSGSHLSGTFGPVNPSLNSTYDFMSTLFLEISSVFPDFYLHLGGDEVDFTCWK
SNPNIQAFMKKKGFTDFKQLESFYIQTLLDIVSDYDKGYVVWQEVFDNKVKVRPDTIIQVWREEMPVE
YMLEMQDITRAGFRALLSAPWYLNRVKYGPDWKDMYKVEPLAFHGTPEQKALVIGGEACMWGEYVDST
NLVPRLWPRAGAVAERLWSSNLTTNIDFAFKRLSHFRCELVRRGIQAQPISVGCCEQEFEQT*ATSAE
HPGGCCPLSQLR*APRRVLALREQVPGQG*SFTASRPGESTPCPCAPVTTEKEAGAGTGVQ*RSMWHF

Remember that the human HEXA gene was on the third reading frame, so you need to tell the function which frame to use.

humanProtein = nt2aa(humanHEXA.Sequence, 'Frame' ,3);

One of the easiest ways to look for similarity between sequences is with a dot plot. A dot plot is made by creating a sparse matrix of size m by n where m is the length of one sequence and n the length of the other. A dot, or 1 is placed in the (i,j) element of the matrix if the i th amino acid of the first sequence is the same as the j th amino acid of the second sequence. A spy plot of the matrix is generated and any areas of strong similarity show up as diagonal lines.

figure
seqdotplot(humanProtein,mouseProtein)
ylabel( 'Human hexosaminidase A' );xlabel( 'Mouse hexosaminidase A' );


Figure 4: Dot plot comparison of the human gene and the mouse gene.

This plot shows a fairly obvious diagonal line; however there are also large areas of gray. These may be regions of interest or, given that there are only 20 amino acids, some of which occur very frequently, they might simply be noise. The noise can be filtered out by using a more stringent test, say three out of four amino acids around (i,j) must be identical for the (i,j) element to be set to 1.

figure
seqdotplot(humanProtein,mouseProtein,4,3)
ylabel( 'Human hexosaminidase A' );xlabel( 'Mouse hexosaminidase A' );


Figure 5: Refined dot plot of the human gene and the mouse gene.

This produces a much cleaner plot with a clear diagonal line and only a few other points. The diagonal line indicates that there is probably a good alignment between the two sequences. The next thing to do is to try to quantify this. You can do this global alignment using the function nwalign, which uses the Needleman-Wunsch algorithm for sequence alignment.

[score, globalAlignment] = nwalign(humanProtein,mouseProtein);

The first output, score, gives a numerical value of the information content of the alignment and the second output is a character array showing the optimal way that the two sequences align. In addition to looking for identical matches between two amino acids, the algorithms will look for matches between chemically similar amino acids.

The function showalignment displays the alignment in the Help Browser with matching and similar residues highlighted in different colors.

showalignment(globalAlignment);

Figure 6: Global alignment of the two genes. Click on image to see enlarged view.

Refining the alignment

The alignment is very good for the first 540 nucleotides after which the two sequences appear to be unrelated. Notice that there is a * in the sequence at this point. This corresponds to a "Stop" codon, the codon that tells the transcription mechanism in the cell to stop reading a gene. If the sequence is shortened so that only the regions up to the Stops are considered, then it seems likely that you will get a higher scoring alignment. Use the find command to look for the indices of the Stops in the sequence.

humanStops = find(humanProtein == '*');
mouseStops = find(mouseProtein == '*');

Use these indices to truncate the sequences at the STOPs.

humanSeq = humanProtein(1:humanStops(1));
mouseSeq = mouseProtein(1:mouseStops(1));

If you align these two sequences and then view them you will see a very good global alignment.

[score, alignment] = nwalign(humanSeq,mouseSeq);
showalignment(alignment);


Figure 7: Refined global alignment of the two genes. Click on image to see enlarged view.

Another approach to truncating the sequences is to use a local alignment. The function swalign performs local alignment using the Smith-Waterman algorithm. Instead of using the whole of both sequences and looking for the best possible way of aligning them, local alignment looks for the most closely related region in the two sequences and ignores all other parts of the sequences. This is very useful for looking for biologically important regions, as the most important regions in a gene are usually very stable even if evolution causes two organisms to diverge.

[score, localAlignment] = swalign(humanProtein,mouseProtein);
showalignment(localAlignment);



Figure 8: Local alignment of the two genes. Click on image to see enlarged view.

This alignment shows that there are very long sections of the two sequences that are very similar, with several regions of twenty or more amino acids in the sequences that are identical. This suggests that the mouse version of the gene might be a good model to use for investigating treatments or therapies for Tay-Sachs disease.

This example used two very closely related sequences that align very easily. This is not always the case, and as the similarity of two sequences decreases, the complexity of finding alignments between them increases. The algorithms used in the example rely on probabilistic scoring methods and will always give an answer but do not guarantee that this answer is statistically significant.

With the Bioinformatics Toolbox you have a powerful set of tools for studying the problems of sequence analysis and there are many more functions for working with sequence information as well as for analyzing microarray data and protein features.

For a full function list, go to
www.mathworks.com/access/helpdesk/help/toolbox/bioinfo/

For more examples from the toolbox, go to
www.mathworks.com/products/bioinfo/demos.html

View Related Webinar MATLAB for Bioinformatics. Topics in this 58 minute recorded Webinar include:
  • Sequence analysis
  • Microarray data analysis
  • Bioinformatics algorithm deployment
Contact sales
Subscribe to newsletters