Features and Functions

Data Formats and Databases

The toolbox accesses many of the databases on the Web and other online data sources. It allows you to copy data into the MATLAB® Workspace, and read and write to files with standard bioinformatic formats. It also reads many common genome file formats, so that you do not have to write and maintain your own file readers.

Web-based databases — You can directly access public databases on the Web and copy sequence and gene expression information into the MATLAB environment.

The sequence databases currently supported are GenBank® (getgenbank), GenPept (getgenpept), European Molecular Biology Laboratory (EMBL) (getembl), and Protein Data Bank (PDB) (getpdb). You can also access data from the NCBI Gene Expression Omnibus (GEO) Web site by using a single function (getgeodata).

Get multiply aligned sequences (gethmmalignment), hidden Markov model profiles (gethmmprof), and phylogenetic tree data (gethmmtree) from the PFAM database.

Gene Ontology database — Load the database from the Web into a gene ontology object (geneont). Select sections of the ontology with methods for the geneont object (geneont.getancestors, geneont.getdescendants, geneont.getmatrix, geneont.getrelatives), and manipulate data with utility functions (goannotread, num2goid).

Read data from instruments — Read data generated from gene sequencing instruments (scfread, joinseq, traceplot), mass spectrometers (jcampread), and Agilent® microarray scanners (agferead).

Reading data formats — The toolbox provides a number of functions for reading data from common bioinformatic file formats.

Writing data formats — The functions for getting data from the Web include the option to save the data to a file. However, there is a function to write data to a file using the FASTA format (fastawrite).

BLAST searches — Request Web-based BLAST searches (blastncbi), get the results from a search (getblast) and read results from a previously saved BLAST formatted report file (blastread).

The MATLAB environment has built-in support for other industry-standard file formats including Microsoft® Excel® and comma-separated-value (CSV) files. Additional functions perform ASCII and low-level binary I/O, allowing you to develop custom functions for working with any data format.

Sequence Alignments

You can select from a list of analysis methods to compare nucleotide or amino acid sequences using pairwise or multiple sequence alignment functions.

Pairwise sequence alignment — Efficient implementations of standard algorithms such as the Needleman-Wunsch (nwalign) and Smith-Waterman (swalign) algorithms for pairwise sequence alignment. The toolbox also includes standard scoring matrices such as the PAM and BLOSUM families of matrices (blosum, dayhoff, gonnet, nuc44, pam). Visualize sequence similarities with seqdotplot and sequence alignment results with showalignment.

Multiple sequence alignment — Functions for multiple sequence alignment (multialign, profalign) and functions that support multiple sequences (multialignread, fastaread, showalignment). There is also a graphical interface (seqalignviewer) for viewing the results of a multiple sequence alignment and manually making adjustment.

Multiple sequence profiles — Implementations for multiple alignment and profile hidden Markov model algorithms (gethmmprof, gethmmalignment, gethmmtree, pfamhmmread, hmmprofalign, hmmprofestimate, hmmprofgenerate, hmmprofmerge, hmmprofstruct, showhmmprof).

Biological codes — Look up the letters or numeric equivalents for commonly used biological codes (aminolookup, baselookup, geneticcode, revgeneticcode).

Sequence Utilities and Statistics

You can manipulate and analyze your sequences to gain a deeper understanding of the physical, chemical, and biological characteristics of your data. Use a graphical user interface (GUI) with many of the sequence functions in the toolbox (seqviewer).

Sequence conversion and manipulation — The toolbox provides routines for common operations, such as converting DNA or RNA sequences to amino acid sequences, that are basic to working with nucleic acid and protein sequences (aa2int, aa2nt, dna2rna, rna2dna, int2aa, int2nt, nt2aa, nt2int, seqcomplement, seqrcomplement, seqreverse).

You can manipulate your sequence by performing an in silico digestion with restriction endonucleases (restrict) and proteases (cleave).

Sequence statistics — Determine various statistics about a sequence (aacount, basecount, codoncount, dimercount, nmercount, ntdensity, codonbias, cpgisland, oligoprop), search for specific patterns within a sequence (seqshowwords, seqwordcount), or search for open reading frames (seqshoworfs). In addition, you can create random sequences for test cases (randseq).

Sequence utilities — Determine a consensus sequence from a set of multiply aligned amino acid, nucleotide sequences (seqconsensus, or a sequence profile (seqprofile). Format a sequence for display (seqdisp) or graphically show a sequence alignment with frequency data (seqlogo).

Additional MATLAB functions efficiently handle string operations with regular expressions (regexp, seq2regexp) to look for specific patterns in a sequence and search through a library for string matches (seqmatch).

Look for possible cleavage sites in a DNA/RNA sequence by searching for palindromes (palindromes).

Protein Property Analysis

You can use a collection of protein analysis methods to extract information from your data. You can determine protein characteristics and simulate enzyme cleavage reactions. The toolbox provides functions to calculate various properties of a protein sequence, such as the atomic composition (atomiccomp), molecular weight (molweight), and isoelectric point (isoelectric). You can cleave a protein with an enzyme (cleave, rebasecuts) and create distance and Ramachandran plots for PDB data (pdbdistplot, ramachandran). The toolbox contains a graphical user interface for protein analysis (proteinplot) and plotting 3-D protein and other molecular structures with information from molecule model files, such as PDB files (molviewer).

Amino acid sequence utilities — Calculate amino acid statistics for a sequence (aacount) and get information about character codes (aminolookup).

Phylogenetic Analysis

You can use functions for phylogenetic tree building and analysis. There is also a GUI to draw phylograms (trees).

Phylogenetic tree data — Read and write Newick-formatted tree files (phytreeread, phytreewrite) into the MATLAB Workspace as phylogenetic tree objects (phytree).

Create a phylogenetic tree — Calculate the pairwise distance between biological sequences (seqpdist), estimate the substitution rates (dnds, dndsml), build a phylogenetic tree from pairwise distances (seqlinkage, seqneighjoin, reroot), and view the tree in an interactive GUI that allows you to view, edit, and explore the data (phytreeviewer or view). This GUI also allows you to prune branches, reorder, rename, and explore distances.

Phylogenetic tree object methods — You can access the functionality of the phytreeviewer GUI using methods for a phylogenetic tree object (phytree). Get property values (get) and node names (getbyname). Calculate the patristic distances between pairs of leaf nodes (pdist, weights) and draw a phylogenetic tree object in a MATLAB Figure window as a phylogram, cladogram, or radial treeplot (plot). Manipulate tree data by selecting branches and leaves using a specified criterion (select, subtree) and removing nodes (prune). Compare trees (getcanonical) and use Newick-formatted strings (getnewickstr).

Microarray Data Analysis

The MATLAB environment is widely used for microarray data analysis, including reading, filtering, normalizing, and visualizing microarray data. However, the standard normalization and visualization tools that scientists use can be difficult to implement. The toolbox includes these standard functions:

Microarray data — Read Affymetrix GeneChip files (affyread) and plot data (probesetplot), ImaGene results files (imageneread), SPOT files (sptread) and Agilent microarray scanner files (agferead). Read GenePix GPR files (gprread) and GAL files (galread). Get Gene Expression Omnibus (GEO) data from the Web (getgeodata) and read GEO data from files (geosoftread).

A utility function (magetfield) extracts data from one of the microarray reader functions (gprread, agferead, sptread, imageneread).

Microarray normalization and filtering — The toolbox provides a number of methods for normalizing microarray data, such as lowess normalization (malowess) and mean normalization (manorm), or across multiple arrays (quantilenorm). You can use filtering functions to clean raw data before analysis (geneentropyfilter, genelowvalfilter, generangefilter, genevarfilter), and calculate the range and variance of values (exprprofrange, exprprofvar).

Microarray visualization — The toolbox contains routines for visualizing microarray data. These routines include spatial plots of microarray data (maimage, redgreencmap), box plots (maboxplot), loglog plots (maloglog), and intensity-ratio plots (mairplot). You can also view clustered expression profiles (clustergram, redgreencmap). You can create 2-D scatter plots of principal components from the microarray data (mapcaplot).

Microarray utility functions — Use the following functions to work with Affymetrix GeneChip data sets. Get library information for a probe (probelibraryinfo), gene information from a probe set (probesetlookup), and probe set values from CEL and CDF information (probesetvalues). Show probe set information from NetAffx™ Analysis Center (probesetlink) and plot probe set values (probesetplot).

The toolbox accesses statistical routines to perform cluster analysis and to visualize the results, and you can view your data through statistical visualizations such as dendrograms, classification, and regression trees.

Microarray Data Storage

The toolbox includes functions, objects, and methods for creating, storing, and accessing microarray data.

The object constructor function, DataMatrix, lets you create a DataMatrix object to encapsulate data and metadata from a microarray experiment. A DataMatrix object stores experimental data in a matrix, with rows typically corresponding to gene names or probe identifiers, and columns typically corresponding to sample identifiers. A DataMatrix object also stores metadata, including the gene names or probe identifiers (as the row names) and sample identifiers (as the column names).

You can reference microarray expression values in a DataMatrix object the same way you reference data in a MATLAB array, that is, by using linear or logical indexing. Alternately, you can reference this experimental data by gene (probe) identifiers and sample identifiers. Indexing by these identifiers lets you quickly and conveniently access subsets of the data without having to maintain additional index arrays.

Many MATLAB operators and arithmetic functions are available to DataMatrix objects by means of methods. These methods let you modify, combine, compare, analyze, plot, and access information from DataMatrix objects. Additionally, you can easily extend the functionality by using general element-wise functions, dmarrayfun and dmbsxfun, and by manually accessing the properties of a DataMatrix object.

Mass Spectrometry Data Analysis

The mass spectrometry functions preprocess and classify raw data from SELDI-TOF and MALDI-TOF spectrometers and use statistical learning functions to identify patterns.

Reading raw data — Load raw mass/charge and ion intensity data from comma-separated-value (CSV) files, or read a JCAMP-DX-formatted file with mass spectrometry data (jcampread) into the MATLAB environment.

You can also have data in TXT files and use the importdata function.

Preprocessing raw data — Resample high-resolution data to a lower resolution (msresample) where the extra data points are not needed. Correct the baseline (msbackadj). Align a spectrum to a set of reference masses (msalign) and visually verify the alignment (msheatmap). Normalize the area between spectra for comparing (msnorm), and filter out noise (mslowess and mssgolay).

Spectrum analysis — Load spectra into a GUI (msviewer) for selecting mass peaks and further analysis.

The following graphic illustrates the roles of the various mass spectrometry functions in the toolbox.

Graph Theory Functions

Graph theory functions in the toolbox apply basic graph theory algorithms to sparse matrices. A sparse matrix represents a graph, any nonzero entries in the matrix represent the edges of the graph, and the values of these entries represent the associated weight (cost, distance, length, or capacity) of the edge. Graph algorithms that use the weight information will cancel the edge if a NaN or an Inf is found. Graph algorithms that do not use the weight information will consider the edge if a NaN or an Inf is found, because these algorithms look only at the connectivity described by the sparse matrix and not at the values stored in the sparse matrix.

Sparse matrices can represent four types of graphs:

  • Directed Graph — Sparse matrix, either double real or logical. Row (column) index indicates the source (target) of the edge. Self-loops (values in the diagonal) are allowed, although most of the algorithms ignore these values.

  • Undirected Graph — Lower triangle of a sparse matrix, either double real or logical. An algorithm expecting an undirected graph ignores values stored in the upper triangle of the sparse matrix and values in the diagonal.

  • Direct Acyclic Graph (DAG) — Sparse matrix, double real or logical, with zero values in the diagonal. While a zero-valued diagonal is a requirement of a DAG, it does not guarantee a DAG. An algorithm expecting a DAG will not test for cycles because this will add unwanted complexity.

  • Spanning Tree — Undirected graph with no cycles and with one connected component.

There are no attributes attached to the graphs; sparse matrices representing all four types of graphs can be passed to any graph algorithm. All functions will return an error on nonsquare sparse matrices.

Graph algorithms do not pretest for graph properties because such tests can introduce a time penalty. For example, there is an efficient shortest path algorithm for DAG, however testing if a graph is acyclic is expensive compared to the algorithm. Therefore, it is important to select a graph theory function and properties appropriate for the type of the graph represented by your input matrix. If the algorithm receives a graph type that differs from what it expects, it will either:

  • Return an error when it reaches an inconsistency. For example, if you pass a cyclic graph to the graphshortestpath function and specify Acyclic as the method property.

  • Produce an invalid result. For example, if you pass a directed graph to a function with an algorithm that expects an undirected graph, it will ignore values in the upper triangle of the sparse matrix.

The graph theory functions include graphallshortestpaths, graphconncomp, graphisdag, graphisomorphism, graphisspantree, graphmaxflow, graphminspantree, graphpred2path, graphshortestpath, graphtopoorder, and graphtraverse.

Graph Visualization

The toolbox includes functions, objects, and methods for creating, viewing, and manipulating graphs such as interactive maps, hierarchy plots, and pathways. This allows you to view relationships between data.

The object constructor function (biograph) lets you create a biograph object to hold graph data. Methods of the biograph object let you calculate the position of nodes (dolayout), draw the graph (view), get handles to the nodes and edges (getnodesbyid and getedgesbynodeid) to further query information, and find relations between the nodes (getancestors, getdescendants, and getrelatives). There are also methods that apply basic graph theory algorithms to the biograph object.

Various properties of a biograph object let you programmatically change the properties of the rendered graph. You can customize the node representation, for example, drawing pie charts inside every node (CustomNodeDrawFcn). Or you can associate your own callback functions to nodes and edges of the graph, for example, opening a Web page with more information about the nodes (NodeCallback and EdgeCallback).

Statistical Learning and Visualization

You can classify and identify features in data sets, set up cross-validation experiments, and compare different classification methods.

The toolbox provides functions that build on the classification and statistical learning tools in the Statistics and Machine Learning Toolbox™ software (classify, kmeans, and treefit).

These functions include imputation tools (knnimpute), and K-nearest neighbor classifiers (knnclassify).

Other functions include set up of cross-validation experiments (crossvalind) and comparison of the performance of different classification methods (classperf). In addition, there are tools for selecting diversity and discriminating features (rankfeatures, randfeatures).

Prototyping and Development Environment

The MATLAB environment lets you prototype and develop algorithms and easily compare alternatives.

  • Integrated environment — Explore biological data in an environment that integrates programming and visualization. Create reports and plots with the built-in functions for mathematics, graphics, and statistics.

  • Open environment — Access the source code for the toolbox functions. The toolbox includes many of the basic bioinformatics functions you will need to use, and it includes prototypes for some of the more advanced functions. Modify these functions to create your own custom solutions.

  • Interactive programming language — Test your ideas by typing functions that are interpreted interactively with a language whose basic data element is an array. The arrays do not require dimensioning and allow you to solve many technical computing problems,

    Using matrices for sequences or groups of sequences allows you to work efficiently and not worry about writing loops or other programming controls.

  • Programming tools — Use a visual debugger for algorithm development and refinement and an algorithm performance profiler to accelerate development.

Data Visualization

You can visually compare pairwise sequence alignments, multiply aligned sequences, gene expression data from microarrays, and plot nucleic acid and protein characteristics. The 2-D and volume visualization features let you create custom graphical representations of multidimensional data sets. You can also create montages and overlays, and export finished graphics to an Adobe® PostScript® image file or copy directly into Microsoft PowerPoint®.

Algorithm Sharing and Application Deployment

The open MATLAB environment lets you share your analysis solutions with other users, and it includes tools to create custom software applications. With the addition of MATLAB Compiler™ and MATLAB Compiler SDK™, you can create standalone applications independent of the MATLAB environment.

  • Share algorithms with other users — You can share data analysis algorithms created in the MATLAB language across all supported platforms by giving files to other users. You can also create GUIs within the MATLAB environment using the Graphical User Interface Development Environment (GUIDE).

  • Deploy MATLAB GUIs — Create a GUI within the MATLAB environment using GUIDE, and then use MATLAB Compiler software to create a standalone GUI application that runs separately from the MATLAB environment.

  • Create dynamic link libraries (DLLs) — Use MATLAB Compiler software to create DLLs for your functions, and then link these libraries to other programming environments such as C and C++.

  • Create COM objects — Use MATLAB Compiler SDK to create COM objects, and then use a COM-compatible programming environment (Visual Basic®) to create a standalone application.

  • Create Excel add-ins — Use MATLAB Compiler to create Excel add-in functions, and then use these functions with Excel spreadsheets.

  • Create Java® classes — Use MATLAB Compiler SDK to automatically generate Java classes from algorithms written in the MATLAB programming language. You can run these classes outside the MATLAB environment.

Was this topic helpful?