Data Formats and Databases
The Bioinformatics Toolbox™ lets you access many of the databases on the web and other online data repositories. It lets you copy data into the MATLAB® workspace, and read and write to files with standard bioinformatic formats. It also reads many common genome file formats so that you do not have to write and maintain your own file readers.
Web-based databases — You can directly access public databases on the Web and copy sequence and gene expression information into the MATLAB environment.
The sequence databases currently supported are GenBank® (
getgenbank), GenPept (
getgenpept), European Molecular Biology Laboratory (EMBL) (
getembl), and Protein Data Bank (PDB) (
getpdb). You can also access data
from the NCBI Gene Expression Omnibus (GEO) Web site by using a single function
Gene Ontology database — Load the database
from the Web into a gene ontology object (
geneont). Select sections of the ontology with methods for the geneont
getrelatives (geneont)), and manipulate data with utility functions
Reading data formats — The toolbox provides a number of functions for reading data from common bioinformatic file formats.
Multiply aligned sequences: ClustalW and GCG formats (
Gene expression data from microarrays: Gene Expression Omnibus (GEO) data (
geosoftread), GenePix® data in GPR and GAL files (
galread), SPOT data (
sptread), Affymetrix® GeneChip® data (
affyread), and ImaGene® results files (
Hidden Markov model profiles: PFAM-HMM file (
Writing data formats — The functions for
getting data from the Web include the option to save the data to a file. However, there
is a function to write data to a file using the FASTA format (
The MATLAB environment has built-in support for other industry-standard file formats including Microsoft® Excel® and comma-separated-value (CSV) files. Additional functions perform ASCII and low-level binary I/O, allowing you to develop custom functions for working with any data format.