Skip to Main Content Skip to Search
Home |   Select Country  Choose Country  |  Contact Us  |  Cart Store 
Create Account | Log In
Products & Services Solutions Academia Support User Community Company
spacer spacer spacer spacer spacer spacer

 

Bioinformatics Toolbox 3.4

Aligning Pairs of Sequences

Extracts some sequences from GenBank®, shows how to find open reading frames (ORFs), and then aligns the sequences using global and local alignment algorithms.

Contents

Using the Help Browser as a Web Browser

The Help Browser can be utilized to explore the web; for example, you can view the NCBI web site (http://www.ncbi.nlm.nih.gov/) from within MATLAB®.

web('http://www.ncbi.nlm.nih.gov/')

One of the many fascinating parts of the NCBI web site is the "Genes and diseases" section. This section provides a comprehensive introduction to medical genetics.

web('http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&
;rid=gnd')

In this demonstration you will be looking at genes associated with Tay-Sachs Disease.

web('http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&
;rid=gnd.section.220')

Tay-Sachs is an autosomal recessive disease caused by mutations in both alleles of a gene (HEXA, which codes for the alpha subunit of hexosaminidase A) on chromosome 15. The NCBI reference sequence for HEXA has accession number NM_000520.

web('http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=18
9181665')

Accessing NCBI Data from the MATLAB® Workspace

You can use the getgenbank function to read the sequence information into MATLAB.

humanHEXA = getgenbank('NM_000520')
humanHEXA =

                LocusName: 'NM_000520'
      LocusSequenceLength: '2437'
     LocusNumberofStrands: ''
            LocusTopology: 'linear'
        LocusMoleculeType: 'mRNA'
     LocusGenBankDivision: 'PRI'
    LocusModificationDate: '29-MAR-2009'
               Definition: [1x63 char]
                Accession: 'NM_000520'
                  Version: 'NM_000520.4'
                       GI: '189181665'
                  Project: []
                   DBLink: []
                 Keywords: []
                  Segment: []
                   Source: 'Homo sapiens (human)'
           SourceOrganism: [4x65 char]
                Reference: {1x10 cell}
                  Comment: [32x67 char]
                 Features: [150x74 char]
                      CDS: [1x1 struct]
                 Sequence: [1x2437 char]
                SearchURL: [1x70 char]
              RetrieveURL: [1x104 char]

By doing a BLAST search or by searching in the mouse genome you will find that AK080777 is a mouse hexosaminidase A gene.

mouseHEXA = getgenbank('AK080777')
mouseHEXA =

                LocusName: 'AK080777'
      LocusSequenceLength: '1839'
     LocusNumberofStrands: ''
            LocusTopology: 'linear'
        LocusMoleculeType: 'mRNA'
     LocusGenBankDivision: 'HTC'
    LocusModificationDate: '19-SEP-2008'
               Definition: [1x150 char]
                Accession: 'AK080777'
                  Version: 'AK080777.1'
                       GI: '26348756'
                  Project: []
                   DBLink: []
                 Keywords: 'HTC; HTC_FLI; CAP trapper.'
                  Segment: []
                   Source: 'Mus musculus (house mouse)'
           SourceOrganism: [4x65 char]
                Reference: {1x8 cell}
                  Comment: [8x66 char]
                 Features: [33x74 char]
                      CDS: [1x1 struct]
                 Sequence: [1x1839 char]
                SearchURL: [1x69 char]
              RetrieveURL: [1x103 char]

If you don't have a live web connection, you can load the data from a MAT-file using the command

%load hexosaminidase        % <== Uncomment this if no live connection

Exploring the Open Reading Frames (ORFs)

You can use the function seqshoworfs to look for ORFs in the sequence for the human HEXA gene. Notice that the longest ORF is on the first reading frame. The output value in the variable 'humanORFs' is a structure giving the position of the start and stop codons for all the ORFs on each reading frame.

humanORFs = seqshoworfs(humanHEXA.Sequence)
humanORFs =

1x3 struct array with fields:
    Start
    Stop

Now look at the ORFs in the mouse HEXA gene. In this case the ORF is on frame 1.

mouseORFs = seqshoworfs(mouseHEXA.Sequence)
mouseORFs =

1x3 struct array with fields:
    Start
    Stop

Aligning the Sequences

The first step is to use global sequence alignment to look for similarities between these sequences. You could look at the alignment between the nucleotide sequences but it is generally more instructive to look at the alignment between the protein sequences. So the first thing that you need to do is to convert the nucleotide sequences into the corresponding amino acid sequences. Use the nt2aa function to do this.

humanProtein = nt2aa(humanHEXA.Sequence);
mouseProtein = nt2aa(mouseHEXA.Sequence);

One of the easiest ways to look for similarity between sequences is with a dot plot.

seqdotplot(humanProtein,mouseProtein)
ylabel('Human hexosaminidase A');xlabel('Mouse hexosaminidase A');
Warning: Match matrix has more points than available screen pixels.
         Scaling image by factors of 1 in X and 2 in Y

With the default settings, the dot plot is a little difficult to interpret, so you can try a slightly more stringent dot plot.

seqdotplot(humanProtein,mouseProtein,4,3)
ylabel('Human hexosaminidase A');xlabel('Mouse hexosaminidase A');
Warning: Match matrix has more points than available screen pixels.
         Scaling image by factors of 1 in X and 2 in Y

The diagonal line indicates that there is probably a good alignment so you can now take a look at the global alignment using the function nwalign which uses the Needleman-Wunsch algorithm.

[score, globalAlignment] = nwalign(humanProtein,mouseProtein);

The function showalignment displays the alignment in the Help Browser with matching and similar residues highlighted in different colors.

showalignment(globalAlignment);

Refining the Alignment

The alignment is very good for the first 560 nucleotides after which the two sequences appear to be unrelated. Notice that there is a STOP ( * ) in the sequence at this point. If the sequence is shortened so that only the regions up until the STOPs are considered, then it seems likely that you will get a better alignment. Use the find command to look for the indices of the STOPs in the sequence.

humanStops = find(humanProtein == '*')
mouseStops = find(mouseProtein == '*')
humanStops =

    41   599   611   713   722   730


mouseStops =

   539   557   574   606

Use these indices to truncate the sequences at the STOPs. Be careful as there is also a STOP at the very beginning of the humanProtein sequence.

humanSeq = humanProtein(1:humanStops(2));
humanSeqFormatted = seqdisp(humanSeq)
mouseSeq = mouseProtein(1:mouseStops(1));
mouseSeqFormatted = seqdisp(mouseSeq)
humanSeqFormatted =

  1  SCRRPAQSAA RSRSLRSRPE VKGQGVGPPG VAGAEPPLVT *FADKSRGRR SPDQGLTWPA
 61  PSERGDQRAM TSSRLWFSLL LAAAFAGRAT ALWPWPQNFQ TSDQRYVLYP NNFQFQYDVS
121  SAAQPGCSVL DEAFQRYRDL LFGSGSWPRP YLTGKRHTLE KNVLVVSVVT PGCNQLPTLE
181  SVENYTLTIN DDQCLLLSET VWGALRGLET FSQLVWKSAE GTFFINKTEI EDFPRFPHRG
241  LLLDTSRHYL PLSSILDTLD VMAYNKLNVF HWHLVDDPSF PYESFTFPEL MRKGSYNPVT
301  HIYTAQDVKE VIEYARLRGI RVLAEFDTPG HTLSWGPGIP GLLTPCYSGS EPSGTFGPVN
361  PSLNNTYEFM STFFLEVSSV FPDFYLHLGG DEVDFTCWKS NPEIQDFMRK KGFGEDFKQL
421  ESFYIQTLLD IVSSYGKGYV VWQEVFDNKV KIQPDTIIQV WREDIPVNYM KELELVTKAG
481  FRALLSAPWY LNRISYGPDW KDFYIVEPLA FEGTPEQKAL VIGGEACMWG EYVDNTNLVP
541  RLWPRAGAVA ERLWSNKLTS DLTFAYERLS HFRCELLRRG VQAQPLNVGF CEQEFEQT*


mouseSeqFormatted =

  1  AAGRGAGRWA MAGCRLWVSL LLAAALACLA TALWPWPQYI QTYHRRYTLY PNNFQFRYHV
 61  SSAAQAGCVV LDEAFRRYRN LLFGSGSWPR PSFSNKQQTL GKNILVVSVV TAECNEFPNL
121  ESVENYTLTI NDDQCLLASE TVWGALRGLE TFSQLVWKSA EGTFFINKTK IKDFPRFPHR
181  GVLLDTSRHY LPLSSILDTL DVMAYNKFNV FHWHLVDDSS FPYESFTFPE LTRKGSFNPV
241  THIYTAQDVK EVIEYARLRG IRVLAEFDTP GHTLSWGPGA PGLLTPCYSG SHLSGTFGPV
301  NPSLNSTYDF MSTLFLEISS VFPDFYLHLG GDEVDFTCWK SNPNIQAFMK KKGFTDFKQL
361  ESFYIQTLLD IVSDYDKGYV VWQEVFDNKV KVRPDTIIQV WREEMPVEYM LEMQDITRAG
421  FRALLSAPWY LNRVKYGPDW KDMYKVEPLA FHGTPEQKAL VIGGEACMWG EYVDSTNLVP
481  RLWPRAGAVA ERLWSSNLTT NIDFAFKRLS HFRCELVRRG IQAQPISVGC CEQEFEQT*

If you align these two sequences and then view them you will see a very good global alignment.

[score, alignment] = nwalign(humanSeq,mouseSeq);
showalignment(alignment);

This is still not a perfect alignment at the beginning of the sequence. In order to align the sequences starting with the Met1 we can go back to the information about the ORFs in the nucleotide sequences.

humanPORF = nt2aa(humanHEXA.Sequence(humanORFs(1).Start(1):humanORFs(1).Stop
(1)));
mousePORF = nt2aa(mouseHEXA.Sequence(mouseORFs(1).Start(1):mouseORFs(1).Stop
(1)));
[score, ORFAlignment] = nwalign(humanPORF,mousePORF);
showalignment(ORFAlignment);

Alternatively, you can use the coding region information (CDS) from the GenBank data structure to find the coding region of the genes. You can also get the translation of the coding regions from this structure.

humanCodingRegion = humanHEXA.Sequence(...
            humanHEXA.CDS.indices(1):humanHEXA.CDS.indices(2));
mouseCodingRegion = mouseHEXA.Sequence(...
    mouseHEXA.CDS.indices(1):mouseHEXA.CDS.indices(2));

humanTranslatedRegion = humanHEXA.CDS.translation;
mouseTranslatedRegion = mouseHEXA.CDS.translation;

Local Alignment

Instead of truncating the sequences to look for better alignment, an alternative approach is to use a local alignment. The function swalign performs local alignment using the Smith-Waterman algorithm. This shows a very good alignment for the whole coding region and reasonable similarity for a few residues beyond at both the ends of the gene.

[score, localAlignment] = swalign(humanProtein,mouseProtein);
showalignment(localAlignment);

Alignment of Complementary DNA Sequences

Aligning sequences by complement and not by identity is done with a user defined scoring matrix. In this case a score of +1 is given for complements and a -1 otherwise. The scoring matrix that is defined below has rows and columns with nucleotide order: A C G T. The first 30 nucleotides from the mouse HEXA gene will be aligned to its complement.

scoring_matrix = [-1,-1,-1,1;-1,-1,1,-1;-1,1,-1,-1;1,-1,-1,-1]
[score, compAlignment] = nwalign(mouseHEXA.Sequence(1:30), ...
    seqcomplement(mouseHEXA.Sequence(1:30)), 'SCORINGMATRIX', ...
        scoring_matrix, 'ALPHABET', 'NT')
scoring_matrix =

    -1    -1    -1     1
    -1    -1     1    -1
    -1     1    -1    -1
     1    -1    -1    -1


score =

    30


compAlignment =

GCTGCTGGAAGGGGAGCTGGCCGGTGGGCC
::::::::::::::::::::::::::::::
CGACGACCTTCCCCTCGACCGGCCACCCGG

Suggest an enhancement for this demo.

Contact sales
Free technical kit
Trial software
E-mail this page

Get Pricing and
Licensing Options