| Products & Services | Solutions | Academia | Support | User Community | Company |
| Download Product Updates | | | Get Pricing | | | Trial Software |
| Documentation → Bioinformatics Toolbox |
| Contents | Index |
| Learn more about Bioinformatics Toolbox |
| On this page… |
|---|
Finding a Model Organism to Study Retrieving Sequence Information from a Public Database Searching a Public Database for Related Genes |
Determining the similarity between two sequences is a common task in computational biology. Starting with a nucleotide sequence for a human gene, this example uses alignment algorithms to locate and verify a corresponding gene in a model organism.
The following procedure illustrates how to use the MATLAB Help browser to search the Web for information. In this example, you are interested in studying Tay-Sachs disease. Tay-Sachs is an autosomal recessive disease caused by the absence of the enzyme beta-hexosaminidase A (Hex A). This enzyme is responsible for the breakdown of gangliosides (GM2) in brain and nerve cells.
First, research information about Tay-Sachs and the enzyme that is associated with this disease, then find the nucleotide sequence for the human gene that codes for the enzyme, and finally find a corresponding gene in another organism to use as a model for study.
Use the MATLAB Help browser to explore the Web. In the MATLAB Command window, type
web('http://www.ncbi.nlm.nih.gov/')
The MATLAB Help browser opens with the home page for the NCBI Web site.
Search the NCBI Web site for information. For example, to search for Tay-Sachs, from the Search list, select NCBI Web Site, and in the for box, enter Tay-Sachs.

The NCBI Web search returns a list of links to relevant pages.

Select a result page. For example, click the link labeled Tay-Sachs Disease
A page in the genes and diseases section of the NCBI Web site opens. This section provides a comprehensive introduction to medical genetics. In particular, this page contains an introduction and pictorial representation of the enzyme Hex A and its role in the metabolism of the lipid GM2 ganglioside.

After completing your research, you have concluded the following:
The gene HEXA codes for the alpha subunit of the dimer enzyme hexosaminidase A (Hex A), while the gene HEXB codes for the beta subunit of the enzyme. A third gene, GM2A, codes for the activator protein GM2. However, it is a mutation in the gene HEXA that causes Tay-Sachs.
The following procedure illustrates how to find the nucleotide sequence for a human gene in a public database and read the sequence information into the MATLAB environment. Many public databases for nucleotide sequences (for example, GenBank, EMBL-EBI) are accessible from the Web. The MATLAB Command Window with the MATLAB Help browser provide an integrated environment for searching the Web and bringing sequence information into the MATLAB environment.
After you locate a sequence, you need to move the sequence data into the MATLAB Workspace.
Open the MATLAB Help browser to the NCBI Web site. In the MATLAB Command Widow, type
web('http://www.ncbi.nlm.nih.gov/')
The MATLAB Help browser window opens with the NCBI home page.
Search for the gene you are interested in studying. For example, from the Search list, select Nucleotide, and in the for box enter Tay-Sachs.

The search returns entries for the genes that code the alpha and beta subunits of the enzyme hexosaminidase A (Hex A), and the gene that codes the activator enzyme. The NCBI reference for the human gene HEXA has accession number NM_000520.

Get sequence data into the MATLAB environment. For example, to get sequence information for the human gene HEXA, type
humanHEXA = getgenbank('NM_000520')
Note Blank spaces in GenBank accession numbers use the underline character. Entering 'NM 00520' returns the wrong entry. |
The human gene is loaded into the MATLAB Workspace as a structure.
humanHEXA =
LocusName: 'NM_000520'
LocusSequenceLength: '2255'
LocusNumberofStrands: ''
LocusTopology: 'linear'
LocusMoleculeType: 'mRNA'
LocusGenBankDivision: 'PRI'
LocusModificationDate: '13-AUG-2006'
Definition: 'Homo sapiens hexosaminidase A (alpha polypeptide) (HEXA), mRNA.'
Accession: 'NM_000520'
Version: 'NM_000520.2'
GI: '13128865'
Project: []
Keywords: []
Segment: []
Source: 'Homo sapiens (human)'
SourceOrganism: [4x65 char]
Reference: {1x58 cell}
Comment: [15x67 char]
Features: [74x74 char]
CDS: [1x1 struct]
Sequence: [1x2255 char]
SearchURL: [1x108 char]
RetrieveURL: [1x97 char]The following procedure illustrates how to find the nucleotide sequence for a mouse gene related to a human gene, and read the sequence information into the MATLAB environment. The sequence and function of many genes is conserved during the evolution of species through homologous genes. Homologous genes are genes that have a common ancestor and similar sequences. One goal of searching a public database is to find similar genes. If you are able to locate a sequence in a database that is similar to your unknown gene or protein, it is likely that the function and characteristics of the known and unknown genes are the same.
After finding the nucleotide sequence for a human gene, you can do a BLAST search or search in the genome of another organism for the corresponding gene. This procedure uses the mouse genome as an example.
Open the MATLAB Help browser to the NCBI Web site. In the MATLAB Command window, type
web('http://www.ncbi.nlm.nih.gov')
Search the nucleotide database for the gene or protein you are interested in studying. For example, from the Search list, select Nucleotide, and in the for box enter hexosaminidase A.

The search returns entries for the mouse and human genomes. The NCBI reference for the mouse gene HEXA has accession number AK080777.

Get sequence information for the mouse gene into the MATLAB environment. Type
mouseHEXA = getgenbank('AK080777')
The mouse gene sequence is loaded into the MATLAB Workspace as a structure.
mouseHEXA =
LocusName: 'AK080777'
LocusSequenceLength: '1839'
LocusNumberofStrands: ''
LocusTopology: 'linear'
LocusMoleculeType: 'mRNA'
LocusGenBankDivision: 'HTC'
LocusModificationDate: '02-SEP-2005'
Definition: [1x150 char]
Accession: 'AK080777'
Version: 'AK080777.1'
GI: '26348756'
Project: []
Keywords: 'HTC; CAP trapper.'
Segment: []
Source: 'Mus musculus (house mouse)'
SourceOrganism: [4x65 char]
Reference: {1x8 cell}
Comment: [8x66 char]
Features: [33x74 char]
CDS: [1x1 struct]
Sequence: [1x1839 char]
SearchURL: [1x107 char]
RetrieveURL: [1x97 char]The following procedure illustrates how to convert a sequence from nucleotides to amino acids and identify the open reading frames. A nucleotide sequence includes regulatory sequences before and after the protein coding section. By analyzing this sequence, you can determine the nucleotides that code for the amino acids in the final protein.
After you have a list of genes you are interested in studying, you can determine the protein coding sequences. This procedure uses the human gene HEXA and mouse gene HEXA as an example.
If you did not retrieve gene data from the Web, you can load example data from a MAT-file included with the Bioinformatics Toolbox software. In the MATLAB Command window, type
load hexosaminidase
The structures humanHEXA and mouseHEXA load into the MATLAB Workspace.
Look for open reading frames in the human gene. For example, for the human gene HEXA, type
humanORFs=seqshoworfs(humanHEXA.Sequence)
seqshoworfs creates the output structure humanORFs. This structure gives the position of the start and stop codons for all open reading frames (ORFs) on each reading frame.
humanORFs =
1x3 struct array with fields:
Start
Stop
The Help browser opens with a listing for the three reading frames with the ORFs colored blue, red, and green. Notice that the longest ORF is on the third reading frame.

Locate open reading frames (ORFs) on the mouse gene. Type:
mouseORFs = seqshoworfs(mouseHEXA.Sequence)
seqshoworfs creates the structure mouseORFS.
mouseORFs =
1x3 struct array with fields:
Start
Stop
The mouse gene shows the longest ORF on the first reading frame.

The following procedure illustrates how to use global and local alignment functions to compare two amino acid sequences. You could use alignment functions to look for similarities between two nucleotide sequences, but alignment functions return more biologically meaningful results when you are using amino acid sequences.
After you have located the open reading frames on your nucleotide sequences, you can convert the protein coding sections of the nucleotide sequences to their corresponding amino acid sequences, and then you can compare them for similarities.
Using the identified open reading frames, convert the DNA sequence to the amino acid sequences. Type
mouseProtein = nt2aa(mouseHEXA.Sequence)
Remember that the human HEXA gene was on the third reading frame, so you need to indicate which frame to use.
humanProtein = nt2aa(humanHEXA.Sequence,'frame',3)
Draw a dot plot comparing the human and mouse amino acid sequences. Type
seqdotplot(mouseProtein,humanProtein,4,3)
ylabel('Mouse hexosaminidase A (alpha subunit)')
xlabel('Human hexosaminidase A (alpha subunit)')
Dot plots are one of the easiest ways to look for similarity between sequences. The diagonal line shown below indicates that there may be a good alignment between the two sequences.

Globally align the two amino acid sequences, using the Needleman-Wunsch algorithm. Type
[GlobalScore, GlobalAlignment] = nwalign(humanProtein,...
mouseProtein)
showalignment(GlobalAlignment)
showalignment displays the global alignment of the two sequences in the Help browser. Notice that the calculated identity between the two sequences is 64.5 %.

The alignment is very good for the first 550 nucleotides, after which the two sequences appear to be unrelated. Notice that there is a stop (*) in the sequence at this point. If you shorten the sequence to include only the amino acids that are in the protein (after the first methionine and before the first stop) you might get a better alignment.
Trim the sequence from the first start amino acid (usually M) to the first stop (first *) and then try alignment again. Find the indices for the stops in the sequences.
humanStops = find(humanProtein == '*') humanStops = 538 550 652 661 669 mouseStops = find(mouseProtein =='*') mouseStops = 539 557 574 606
Looking at the amino acid sequence for humanProtein, the first M is at position 9, while the first M for the mouse protein is at 11.
Truncate the sequence to include only amino acids in the protein and the stop.
humanProteinORF = humanProtein(9:humanStops(1)); humanProteinORF = MTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYVLYPNNFQFQYDV SSAAQPGCSVLDEAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVV TPGCNQLPTLESVENYTLTINDDQCLLLSETVWGALRGLETFSQLVWKSA EGTFFINKTEIEDFPRFPHRGLLLDTSRHYLPLSSILDTLDVMAYNKLNV FHWHLVDDPSFPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEYARLRG IRVLAEFDTPGHTLSWGPGIPGLLTPCYSGSEPSGTFGPVNPSLNNTYEF MSTFFLEVSSVFPDFYLHLGGDEVDFTCWKSNPEIQDFMRKKGFGEDFKQ LESFYIQTLLDIVSSYGKGYVVWQEVFDNKVKIQPDTIIQVWREDIPVNY MKELELVTKAGFRALLSAPWYLNRISYGPDWKDFYVVEPLAFEGTPEQKA LVIGGEACMWGEYVDNTNLVPRLWPRAGAVAERLWSNKLTSDLTFAYERL SHFRCELLRRGVQAQPLNVGFCEQEFEQT* mouseProteinORF = mouseProtein(11:mouseStops(1)) mouseProteinORF = MAGCRLWVSLLLAAALACLATALWPWPQYIQTYHRRYTLYPNNFQFRYHV SSAAQAGCVVLDEAFRRYRNLLFGSGSWPRPSFSNKQQTLGKNILVVSVV TAECNEFPNLESVENYTLTINDDQCLLASETVWGALRGLETFSQLVWKSA EGTFFINKTKIKDFPRFPHRGVLLDTSRHYLPLSSILDTLDVMAYNKFNV FHWHLVDDSSFPYESFTFPELTRKGSFNPVTHIYTAQDVKEVIEYARLRG IRVLAEFDTPGHTLSWGPGAPGLLTPCYSGSHLSGTFGPVNPSLNSTYDF MSTLFLEISSVFPDFYLHLGGDEVDFTCWKSNPNIQAFMKKKGFTDFKQL ESFYIQTLLDIVSDYDKGYVVWQEVFDNKVKVRPDTIIQVWREEMPVEYM LEMQDITRAGFRALLSAPWYLNRVKYGPDWKDMYKVEPLAFHGTPEQKAL VIGGEACMWGEYVDSTNLVPRLWPRAGAVAERLWSSNLTTNIDFAFKRLS HFRCELVRRGIQAQPISVGCCEQEFEQT*
Globally align the trimmed amino acid sequences. Type
[Score, Alignment] = nwalign(humanProteinORF,...
mouseProteinORF);
showalignment(Alignment)
showalignment displays the results for the second global alignment. Notice that the percent identity for the untrimmed sequences is 54% and with trimmed sequences 83.3 percent.

Another way to truncate an amino acid sequence to only those amino acids in the protein is to first truncate the nucleotide sequence with indices from the function seqshoworfs. Remember that the ORF for the human HEXA gene was on the third reading frame, and the ORF for the mouse HEXA was on the first reading frame.
humanORFs = seqshoworfs(humanHEXA.Sequence);
mouseORFs = seqshoworfs(humanHEXA.Sequence);
humanPORF = nt2aa(humanHEXA.Sequence(humanORFs(3).Start(1):...
humanORFs(3).Stop(1)))
mousePORF = nt2aa(mouseHEXA.Sequence(mouseORFs(1).Start(1):...
mouseORFs(1).Stop(1)))
[Scale, Alignment] = nwalign(humanPORF, mousePORF)
Show the alignment in the Help browser.
showalignment(Alignment)
The result from first truncating a nucleotide sequence before converting to an amino acid sequence is the same as the result from truncating the amino acid sequence after conversion. See the result in step 6.
An alternative method to working with subsequences is to use a local alignment function with the nontruncated sequences.
Locally align the two amino acid sequences using a Smith-Waterman algorithm. Type
[LocalScore, LocalAlignment] = swalign(humanProtein,...
mouseProtein)
LocalScore =
1057
LocalAlignment
RGDQR-AMTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYV . . .
|| | ||:: ||| |||||||:| ||||||||| :|| :||: . . .
RGAGRWAMAGCRLWVSLLLAAALACLATALWPWPQYIQTYHRRYT . . .
swalign displays the local alignment of two sequences in the Help browser.
Show the alignment in color.
showalignment(LocalAlignment)

![]() | Example: Sequence Statistics | Sequence Tool | ![]() |

Includes the most popular MATLAB recorded presentations with Q&A sessions led by MATLAB experts.
| © 1984-2009- The MathWorks, Inc. - Site Help - Patents - Trademarks - Privacy Policy - Preventing Piracy - RSS |