Products & Services Solutions Academia Support User Community Company

Learn more about Bioinformatics Toolbox   

Features and Functions

Data Formats and Databases

The toolbox accesses many of the databases on the Web and other online data sources. It allows you to copy data into the MATLAB Workspace, and read and write to files with standard bioinformatic formats. It also reads many common genome file formats, so that you do not have to write and maintain your own file readers.

Web-based databases — You can directly access public databases on the Web and copy sequence and gene expression information into the MATLAB environment.

The sequence databases currently supported are GenBank® (getgenbank), GenPept (getgenpept), European Molecular Biology Laboratory (EMBL) (getembl), and Protein Data Bank (PDB) (getpdb). You can also access data from the NCBI Gene Expression Omnibus (GEO) Web site by using a single function (getgeodata).

Get multiply aligned sequences (gethmmalignment), hidden Markov model profiles (gethmmprof), and phylogenetic tree data (gethmmtree) from the PFAM database.

Gene Ontology database — Load the database from the Web into a gene ontology object (geneont). Select sections of the ontology with methods for the geneont object (geneont.getancestors, geneont.getdescendants, geneont.getmatrix, geneont.getrelatives), and manipulate data with utility functions (goannotread, num2goid).

Read data from instruments — Read data generated from gene sequencing instruments (scfread, joinseq, traceplot), mass spectrometers (jcampread), and Agilent® microarray scanners (agferead).

Reading data formats — The toolbox provides a number of functions for reading data from common bioinformatic file formats.

Writing data formats — The functions for getting data from the Web include the option to save the data to a file. However, there is a function to write data to a file using the FASTA format (fastawrite).

BLAST searches — Request Web-based BLAST searches (blastncbi), get the results from a search (getblast) and read results from a previously saved BLAST formatted report file (blastread).

The MATLAB environment has built-in support for other industry-standard file formats including Microsoft Excel and comma-separated-value (CSV) files. Additional functions perform ASCII and low-level binary I/O, allowing you to develop custom functions for working with any data format.

Sequence Alignments

You can select from a list of analysis methods to compare nucleotide or amino acid sequences using pairwise or multiple sequence alignment functions.

Pairwise sequence alignment — Efficient implementations of standard algorithms such as the Needleman-Wunsch (nwalign) and Smith-Waterman (swalign) algorithms for pairwise sequence alignment. The toolbox also includes standard scoring matrices such as the PAM and BLOSUM families of matrices (blosum, dayhoff, gonnet, nuc44, pam). Visualize sequence similarities with seqdotplot and sequence alignment results with showalignment.

Multiple sequence alignment — Functions for multiple sequence alignment (multialign, profalign) and functions that support multiple sequences (multialignread, fastaread, showalignment). There is also a graphical interface (multialignviewer) for viewing the results of a multiple sequence alignment and manually making adjustment.

Multiple sequence profiles — Implementations for multiple alignment and profile hidden Markov model algorithms (gethmmprof, gethmmalignment, gethmmtree, pfamhmmread, hmmprofalign, hmmprofestimate, hmmprofgenerate, hmmprofmerge, hmmprofstruct, showhmmprof).

Biological codes — Look up the letters or numeric equivalents for commonly used biological codes (aminolookup, baselookup, geneticcode, revgeneticcode).

Sequence Utilities and Statistics

You can manipulate and analyze your sequences to gain a deeper understanding of the physical, chemical, and biological characteristics of your data. Use a graphical user interface (GUI) with many of the sequence functions in the toolbox (seqtool).

Sequence conversion and manipulation — The toolbox provides routines for common operations, such as converting DNA or RNA sequences to amino acid sequences, that are basic to working with nucleic acid and protein sequences (aa2int, aa2nt, dna2rna, rna2dna, int2aa, int2nt, nt2aa, nt2int, seqcomplement, seqrcomplement, seqreverse).

You can manipulate your sequence by performing an in silico digestion with restriction endonucleases (restrict) and proteases (cleave).

Sequence statistics — Determine various statistics about a sequence (aacount, basecount, codoncount, dimercount, nmercount, ntdensity, codonbias, cpgisland, oligoprop), search for specific patterns within a sequence (seqshowwords, seqwordcount), or search for open reading frames (seqshoworfs). In addition, you can create random sequences for test cases (randseq).

Sequence utilities — Determine a consensus sequence from a set of multiply aligned amino acid, nucleotide sequences (seqconsensus, or a sequence profile (seqprofile). Format a sequence for display (seqdisp) or graphically show a sequence alignment with frequency data (seqlogo).

Additional MATLAB functions efficiently handle string operations with regular expressions (regexp, seq2regexp) to look for specific patterns in a sequence and search through a library for string matches (seqmatch).

Look for possible cleavage sites in a DNA/RNA sequence by searching for palindromes (palindromes).

Protein Property Analysis

You can use a collection of protein analysis methods to extract information from your data. You can determine protein characteristics and simulate enzyme cleavage reactions. The toolbox provides functions to calculate various properties of a protein sequence, such as the atomic composition (atomiccomp), molecular weight (molweight), and isoelectric point (isoelectric). You can cleave a protein with an enzyme (cleave, rebasecuts) and create distance and Ramachandran plots for PDB data (pdbdistplot, ramachandran). The toolbox contains a graphical user interface for protein analysis (proteinplot) and plotting 3-D protein and other molecular structures with information from molecule model files, such as PDB files (molviewer).

Amino acid sequence utilities — Calculate amino acid statistics for a sequence (aacount) and get information about character codes (aminolookup).

Phylogenetic Analysis

You can use functions for phylogenetic tree building and analysis. There is also a GUI to draw phylograms (trees).

Phylogenetic tree data — Read and write Newick-formatted tree files (phytreeread, phytreewrite) into the MATLAB Workspace as phylogenetic tree objects (phytree).

Create a phylogenetic tree — Calculate the pairwise distance between biological sequences (seqpdist), estimate the substitution rates (dnds, dndsml), build a phylogenetic tree from pairwise distances (seqlinkage, seqneighjoin, reroot), and view the tree in an interactive GUI that allows you to view, edit, and explore the data (phytreetool or view). This GUI also allows you to prune branches, reorder, rename, and explore distances.

Phylogenetic tree object methods — You can access the functionality of the phytreetool GUI using methods for a phylogenetic tree object (phytree). Get property values (get) and node names (getbyname). Calculate the patristic distances between pairs of leaf nodes (pdist, weights) and draw a phylogenetic tree object in a MATLAB Figure window as a phylogram, cladogram, or radial treeplot (plot). Manipulate tree data by selecting branches and leaves using a specified criterion (select, subtree) and removing nodes (prune). Compare trees (getcanonical) and use Newick-formatted strings (getnewickstr).

Microarray Data Analysis

The MATLAB environment is widely used for microarray data analysis, including reading, filtering, normalizing, and visualizing microarray data. However, the standard normalization and visualization tools that scientists use can be difficult to implement. The toolbox includes these standard functions:

Microarray data — Read Affymetrix GeneChip files (affyread) and plot data (probesetplot), ImaGene results files (imageneread), SPOT files (sptread) and Agilent microarray scanner files (agferead). Read GenePix GPR files (gprread) and GAL files (galread). Get Gene Expression Omnibus (GEO) data from the Web (getgeodata) and read GEO data from files (geosoftread).

A utility function (magetfield) extracts data from one of the microarray reader functions (gprread, agferead, sptread, imageneread).

Microarray normalization and filtering — The toolbox provides a number of methods for normalizing microarray data, such as lowess normalization (malowess) and mean normalization (manorm), or across multiple arrays (quantilenorm). You can use filtering functions to clean raw data before analysis (geneentropyfilter, genelowvalfilter, generangefilter, genevarfilter), and calculate the range and variance of values (exprprofrange, exprprofvar).

Microarray visualization — The toolbox contains routines for visualizing microarray data. These routines include spatial plots of microarray data (maimage, redgreencmap), box plots (maboxplot), loglog plots (maloglog), and intensity-ratio plots (mairplot). You can also view clustered expression profiles (clustergram, redgreencmap). You can create 2-D scatter plots of principal components from the microarray data (mapcaplot).

Microarray utility functions — Use the following functions to work with Affymetrix GeneChip data sets. Get library information for a probe (probelibraryinfo), gene information from a probe set (probesetlookup), and probe set values from CEL and CDF information (probesetvalues). Show probe set information from NetAffx™ Analysis Center (probesetlink) and plot probe set values (probesetplot).

The toolbox accesses statistical routines to perform cluster analysis and to visualize the results, and you can view your data through statistical visualizations such as dendrograms, classification, and regression trees.

Microarray Data Storage

The toolbox includes functions, objects, and methods for creating, storing, and accessing microarray data.

The object constructor function, DataMatrix, lets you create a DataMatrix object to encapsulate data and metadata from a microarray experiment. A DataMatrix object stores experimental data in a matrix, with rows typically corresponding to gene names or probe identifiers, and columns typically corresponding to sample identifiers. A DataMatrix object also stores metadata, including the gene names or probe identifiers (as the row names) and sample identifiers (as the column names).

You can reference microarray expression values in a DataMatrix object the same way you reference data in a MATLAB array, that is, by using linear or logical indexing. Alternately, you can reference this experimental data by gene (probe) identifiers and sample identifiers. Indexing by these identifiers lets you quickly and conveniently access subsets of the data without having to maintain additional index arrays.

Many MATLAB operators and arithmetic functions are available to DataMatrix objects by means of methods. These methods let you modify, combine, compare, analyze, plot, and access information from DataMatrix objects. Additionally, you can easily extend the functionality by using general element-wise functions, dmarrayfun and dmbsxfun, and by manually accessing the properties of a DataMatrix object.

Mass Spectrometry Data Analysis

The mass spectrometry functions preprocess and classify raw data from SELDI-TOF and MALDI-TOF spectrometers and use statistical learning functions to identify patterns.

Reading raw data — Load raw mass/charge and ion intensity data from comma-separated-value (CSV) files, or read a JCAMP-DX-formatted file with mass spectrometry data (jcampread) into the MATLAB environment.

You can also have data in TXT files and use the importdata function.

Preprocessing raw data — Resample high-resolution data to a lower resolution (msresample) where the extra data points are not needed. Correct the baseline (msbackadj). Align a spectrum to a set of reference masses (msalign) and visually verify the alignment (msheatmap). Normalize the area between spectra for comparing (msnorm), and filter out noise (mslowess and mssgolay).

Spectrum analysis — Load spectra into a GUI (msviewer) for selecting mass peaks and further analysis.

The following graphic illustrates the roles of the various mass spectrometry functions in the toolbox.

Graph Theory Functions

Graph theory functions in the toolbox apply basic graph theory algorithms to sparse matrices. A sparse matrix represents a graph, any nonzero entries in the matrix represent the edges of the graph, and the values of these entries represent the associated weight (cost, distance, length, or capacity) of the edge. Graph algorithms that use the weight information will cancel the edge if a NaN or an Inf is found. Graph algorithms that do not use the weight information will consider the edge if a NaN or an Inf is found, because these algorithms look only at the connectivity described by the sparse matrix and not at the values stored in the sparse matrix.

Sparse matrices can represent four types of graphs:

There are no attributes attached to the graphs; sparse matrices representing all four types of graphs can be passed to any graph algorithm. All functions will return an error on nonsquare sparse matrices.

Graph algorithms do not pretest for graph properties because such tests can introduce a time penalty. For example, there is an efficient shortest path algorithm for DAG, however testing if a graph is acyclic is expensive compared to the algorithm. Therefore, it is important to select a graph theory function and properties appropriate for the type of the graph represented by your input matrix. If the algorithm receives a graph type that differs from what it expects, it will either:

The graph theory functions include graphallshortestpaths, graphconncomp, graphisdag, graphisomorphism, graphisspantree, graphmaxflow, graphminspantree, graphpred2path, graphshortestpath, graphtopoorder, and graphtraverse.

Graph Visualization

The toolbox includes functions, objects, and methods for creating, viewing, and manipulating graphs such as interactive maps, hierarchy plots, and pathways. This allows you to view relationships between data.

The object constructor function (biograph) lets you create a biograph object to hold graph data. Methods of the biograph object let you calculate the position of nodes (dolayout), draw the graph (view), get handles to the nodes and edges (getnodesbyid and getedgesbynodeid) to further query information, and find relations between the nodes (getancestors, getdescendants, and getrelatives). There are also methods that apply basic graph theory algorithms to the biograph object.

Various properties of a biograph object let you programmatically change the properties of the rendered graph. You can customize the node representation, for example, drawing pie charts inside every node (CustomNodeDrawFcn). Or you can associate your own callback functions to nodes and edges of the graph, for example, opening a Web page with more information about the nodes (NodeCallback and EdgeCallback).

Statistical Learning and Visualization

You can classify and identify features in data sets, set up cross-validation experiments, and compare different classification methods.

The toolbox provides functions that build on the classification and statistical learning tools in the Statistics Toolbox software (classify, kmeans, and treefit).

These functions include imputation tools (knnimpute), support vector machine classifiers (svmclassify, svmtrain) and K-nearest neighbor classifiers (knnclassify).

Other functions include set up of cross-validation experiments (crossvalind) and comparison of the performance of different classification methods (classperf). In addition, there are tools for selecting diversity and discriminating features (rankfeatures, randfeatures).

Prototyping and Development Environment

The MATLAB environment lets you prototype and develop algorithms and easily compare alternatives.

Data Visualization

You can visually compare pairwise sequence alignments, multiply aligned sequences, gene expression data from microarrays, and plot nucleic acid and protein characteristics. The 2-D and volume visualization features let you create custom graphical representations of multidimensional data sets. You can also create montages and overlays, and export finished graphics to an Adobe® PostScript® image file or copy directly into Microsoft® PowerPoint®.

Algorithm Sharing and Application Deployment

The open MATLAB environment lets you share your analysis solutions with other users, and it includes tools to create custom software applications. With the addition of MATLAB Compiler software, you can create standalone applications independent of the MATLAB environment, and, with the addition of MATLAB Builder NE software, you can create GUIs and standalone applications within other programming environments.

  


Recommended Products

Includes the most popular MATLAB recorded presentations with Q&A sessions led by MATLAB experts.

 © 1984-2009- The MathWorks, Inc.    -   Site Help   -   Patents   -   Trademarks   -   Privacy Policy   -   Preventing Piracy   -   RSS