Bioinformatics Toolbox 3.4
Aligning Pairs of Sequences
Extracts some sequences from GenBank®, shows how to find open reading frames (ORFs), and then aligns the sequences using global and local alignment algorithms.
Contents
Using the Help Browser as a Web Browser
The Help Browser can be utilized to explore the web; for example, you can view the NCBI web site (http://www.ncbi.nlm.nih.gov/) from within MATLAB®.
web('http://www.ncbi.nlm.nih.gov/')
One of the many fascinating parts of the NCBI web site is the "Genes and diseases" section. This section provides a comprehensive introduction to medical genetics.
web('http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&
;rid=gnd')
In this demonstration you will be looking at genes associated with Tay-Sachs Disease.
web('http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&
;rid=gnd.section.220')
Tay-Sachs is an autosomal recessive disease caused by mutations in both alleles of a gene (HEXA, which codes for the alpha subunit of hexosaminidase A) on chromosome 15. The NCBI reference sequence for HEXA has accession number NM_000520.
web('http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=18
9181665')
Accessing NCBI Data from the MATLAB® Workspace
You can use the getgenbank function to read the sequence information into MATLAB.
humanHEXA = getgenbank('NM_000520')
humanHEXA =
LocusName: 'NM_000520'
LocusSequenceLength: '2437'
LocusNumberofStrands: ''
LocusTopology: 'linear'
LocusMoleculeType: 'mRNA'
LocusGenBankDivision: 'PRI'
LocusModificationDate: '29-MAR-2009'
Definition: [1x63 char]
Accession: 'NM_000520'
Version: 'NM_000520.4'
GI: '189181665'
Project: []
DBLink: []
Keywords: []
Segment: []
Source: 'Homo sapiens (human)'
SourceOrganism: [4x65 char]
Reference: {1x10 cell}
Comment: [32x67 char]
Features: [150x74 char]
CDS: [1x1 struct]
Sequence: [1x2437 char]
SearchURL: [1x70 char]
RetrieveURL: [1x104 char]
By doing a BLAST search or by searching in the mouse genome you will find that AK080777 is a mouse hexosaminidase A gene.
mouseHEXA = getgenbank('AK080777')
mouseHEXA =
LocusName: 'AK080777'
LocusSequenceLength: '1839'
LocusNumberofStrands: ''
LocusTopology: 'linear'
LocusMoleculeType: 'mRNA'
LocusGenBankDivision: 'HTC'
LocusModificationDate: '19-SEP-2008'
Definition: [1x150 char]
Accession: 'AK080777'
Version: 'AK080777.1'
GI: '26348756'
Project: []
DBLink: []
Keywords: 'HTC; HTC_FLI; CAP trapper.'
Segment: []
Source: 'Mus musculus (house mouse)'
SourceOrganism: [4x65 char]
Reference: {1x8 cell}
Comment: [8x66 char]
Features: [33x74 char]
CDS: [1x1 struct]
Sequence: [1x1839 char]
SearchURL: [1x69 char]
RetrieveURL: [1x103 char]
If you don't have a live web connection, you can load the data from a MAT-file using the command
%load hexosaminidase % <== Uncomment this if no live connection
Exploring the Open Reading Frames (ORFs)
You can use the function seqshoworfs to look for ORFs in the sequence for the human HEXA gene. Notice that the longest ORF is on the first reading frame. The output value in the variable 'humanORFs' is a structure giving the position of the start and stop codons for all the ORFs on each reading frame.
humanORFs = seqshoworfs(humanHEXA.Sequence)
humanORFs =
1x3 struct array with fields:
Start
Stop
Now look at the ORFs in the mouse HEXA gene. In this case the ORF is on frame 1.
mouseORFs = seqshoworfs(mouseHEXA.Sequence)
mouseORFs =
1x3 struct array with fields:
Start
Stop
Aligning the Sequences
The first step is to use global sequence alignment to look for similarities between these sequences. You could look at the alignment between the nucleotide sequences but it is generally more instructive to look at the alignment between the protein sequences. So the first thing that you need to do is to convert the nucleotide sequences into the corresponding amino acid sequences. Use the nt2aa function to do this.
humanProtein = nt2aa(humanHEXA.Sequence); mouseProtein = nt2aa(mouseHEXA.Sequence);
One of the easiest ways to look for similarity between sequences is with a dot plot.
seqdotplot(humanProtein,mouseProtein) ylabel('Human hexosaminidase A');xlabel('Mouse hexosaminidase A');
Warning: Match matrix has more points than available screen pixels.
Scaling image by factors of 1 in X and 2 in Y
With the default settings, the dot plot is a little difficult to interpret, so you can try a slightly more stringent dot plot.
seqdotplot(humanProtein,mouseProtein,4,3) ylabel('Human hexosaminidase A');xlabel('Mouse hexosaminidase A');
Warning: Match matrix has more points than available screen pixels.
Scaling image by factors of 1 in X and 2 in Y
The diagonal line indicates that there is probably a good alignment so you can now take a look at the global alignment using the function nwalign which uses the Needleman-Wunsch algorithm.
[score, globalAlignment] = nwalign(humanProtein,mouseProtein);
The function showalignment displays the alignment in the Help Browser with matching and similar residues highlighted in different colors.
showalignment(globalAlignment);
Refining the Alignment
The alignment is very good for the first 560 nucleotides after which the two sequences appear to be unrelated. Notice that there is a STOP ( * ) in the sequence at this point. If the sequence is shortened so that only the regions up until the STOPs are considered, then it seems likely that you will get a better alignment. Use the find command to look for the indices of the STOPs in the sequence.
humanStops = find(humanProtein == '*') mouseStops = find(mouseProtein == '*')
humanStops =
41 599 611 713 722 730
mouseStops =
539 557 574 606
Use these indices to truncate the sequences at the STOPs. Be careful as there is also a STOP at the very beginning of the humanProtein sequence.
humanSeq = humanProtein(1:humanStops(2)); humanSeqFormatted = seqdisp(humanSeq) mouseSeq = mouseProtein(1:mouseStops(1)); mouseSeqFormatted = seqdisp(mouseSeq)
humanSeqFormatted = 1 SCRRPAQSAA RSRSLRSRPE VKGQGVGPPG VAGAEPPLVT *FADKSRGRR SPDQGLTWPA 61 PSERGDQRAM TSSRLWFSLL LAAAFAGRAT ALWPWPQNFQ TSDQRYVLYP NNFQFQYDVS 121 SAAQPGCSVL DEAFQRYRDL LFGSGSWPRP YLTGKRHTLE KNVLVVSVVT PGCNQLPTLE 181 SVENYTLTIN DDQCLLLSET VWGALRGLET FSQLVWKSAE GTFFINKTEI EDFPRFPHRG 241 LLLDTSRHYL PLSSILDTLD VMAYNKLNVF HWHLVDDPSF PYESFTFPEL MRKGSYNPVT 301 HIYTAQDVKE VIEYARLRGI RVLAEFDTPG HTLSWGPGIP GLLTPCYSGS EPSGTFGPVN 361 PSLNNTYEFM STFFLEVSSV FPDFYLHLGG DEVDFTCWKS NPEIQDFMRK KGFGEDFKQL 421 ESFYIQTLLD IVSSYGKGYV VWQEVFDNKV KIQPDTIIQV WREDIPVNYM KELELVTKAG 481 FRALLSAPWY LNRISYGPDW KDFYIVEPLA FEGTPEQKAL VIGGEACMWG EYVDNTNLVP 541 RLWPRAGAVA ERLWSNKLTS DLTFAYERLS HFRCELLRRG VQAQPLNVGF CEQEFEQT* mouseSeqFormatted = 1 AAGRGAGRWA MAGCRLWVSL LLAAALACLA TALWPWPQYI QTYHRRYTLY PNNFQFRYHV 61 SSAAQAGCVV LDEAFRRYRN LLFGSGSWPR PSFSNKQQTL GKNILVVSVV TAECNEFPNL 121 ESVENYTLTI NDDQCLLASE TVWGALRGLE TFSQLVWKSA EGTFFINKTK IKDFPRFPHR 181 GVLLDTSRHY LPLSSILDTL DVMAYNKFNV FHWHLVDDSS FPYESFTFPE LTRKGSFNPV 241 THIYTAQDVK EVIEYARLRG IRVLAEFDTP GHTLSWGPGA PGLLTPCYSG SHLSGTFGPV 301 NPSLNSTYDF MSTLFLEISS VFPDFYLHLG GDEVDFTCWK SNPNIQAFMK KKGFTDFKQL 361 ESFYIQTLLD IVSDYDKGYV VWQEVFDNKV KVRPDTIIQV WREEMPVEYM LEMQDITRAG 421 FRALLSAPWY LNRVKYGPDW KDMYKVEPLA FHGTPEQKAL VIGGEACMWG EYVDSTNLVP 481 RLWPRAGAVA ERLWSSNLTT NIDFAFKRLS HFRCELVRRG IQAQPISVGC CEQEFEQT*
If you align these two sequences and then view them you will see a very good global alignment.
[score, alignment] = nwalign(humanSeq,mouseSeq); showalignment(alignment);
This is still not a perfect alignment at the beginning of the sequence. In order to align the sequences starting with the Met1 we can go back to the information about the ORFs in the nucleotide sequences.
humanPORF = nt2aa(humanHEXA.Sequence(humanORFs(1).Start(1):humanORFs(1).Stop (1))); mousePORF = nt2aa(mouseHEXA.Sequence(mouseORFs(1).Start(1):mouseORFs(1).Stop (1))); [score, ORFAlignment] = nwalign(humanPORF,mousePORF); showalignment(ORFAlignment);
Alternatively, you can use the coding region information (CDS) from the GenBank data structure to find the coding region of the genes. You can also get the translation of the coding regions from this structure.
humanCodingRegion = humanHEXA.Sequence(... humanHEXA.CDS.indices(1):humanHEXA.CDS.indices(2)); mouseCodingRegion = mouseHEXA.Sequence(... mouseHEXA.CDS.indices(1):mouseHEXA.CDS.indices(2)); humanTranslatedRegion = humanHEXA.CDS.translation; mouseTranslatedRegion = mouseHEXA.CDS.translation;
Local Alignment
Instead of truncating the sequences to look for better alignment, an alternative approach is to use a local alignment. The function swalign performs local alignment using the Smith-Waterman algorithm. This shows a very good alignment for the whole coding region and reasonable similarity for a few residues beyond at both the ends of the gene.
[score, localAlignment] = swalign(humanProtein,mouseProtein); showalignment(localAlignment);
Alignment of Complementary DNA Sequences
Aligning sequences by complement and not by identity is done with a user defined scoring matrix. In this case a score of +1 is given for complements and a -1 otherwise. The scoring matrix that is defined below has rows and columns with nucleotide order: A C G T. The first 30 nucleotides from the mouse HEXA gene will be aligned to its complement.
scoring_matrix = [-1,-1,-1,1;-1,-1,1,-1;-1,1,-1,-1;1,-1,-1,-1] [score, compAlignment] = nwalign(mouseHEXA.Sequence(1:30), ... seqcomplement(mouseHEXA.Sequence(1:30)), 'SCORINGMATRIX', ... scoring_matrix, 'ALPHABET', 'NT')
scoring_matrix =
-1 -1 -1 1
-1 -1 1 -1
-1 1 -1 -1
1 -1 -1 -1
score =
30
compAlignment =
GCTGCTGGAAGGGGAGCTGGCCGGTGGGCC
::::::::::::::::::::::::::::::
CGACGACCTTCCCCTCGACCGGCCACCCGG
Store