Bioinformatics Toolbox

Gene Ontology Enrichment in Microarray Data

This example shows how to enrich microarray gene expression data using the Gene Ontology relationships.


Gene Ontology is a controlled method for describing terms related to genes in any organism. As more gene data is obtained from organisms, it is annotated using Gene Ontology. Gene Ontology is made of three smaller ontologies or aspects: Molecular Function, Biological Process, and Cellular Component. Each of these ontologies contains terms that are organized in a directed acyclic graph with these three terms as the roots. The roots are the broadest terms relating to genes. Terms further away from the roots get more specific. For this example you will use microarray data from the Gene Expression Profile Analysis example to look at the significance of interesting genes and Gene Ontology terms that are associated with the microarray probes. More specifically, you will further investigate to determine if a set of genes that cluster together are also involved in a common molecular function.

Examples Using Gene Ontology Functions

The Gene Ontology database is loaded into a MATLAB® object using the Bioinformatics Toolbox geneont function.

GO = geneont('live',true); % this step takes a while
                 default_namespace: 'gene_ontology'
                    format_version: '1.0'
                      data_version: ''
                           version: ''
                              date: '19:01:2013 00:20'
                          saved_by: 'jenkins'
                 auto_generated_by: ''
                         subsetdef: {13x1 cell}
                            import: ''
                    synonymtypedef: ''
                           idspace: ''
    default_relationship_id_prefix: ''
                        id_mapping: ''
                            remark: 'cvs version: use data-version'
                           typeref: ''
                  unrecognized_tag: {2x2 cell}
                             Terms: [38883x1 geneont.term]

Every Gene Ontology term has an accession number which is a seven digit number preceded by the prefix 'GO:'. To look at a specific term you first create a sub-ontology by sub-scripting the object and then inspect the 'terms' property. A hash-table is implemented in the GO object for efficiently looking up term IDs.

GO(5840).terms % the ribosome Gene Ontology term
            id: 5840
          name: 'ribosome'
      ontology: 'cellular component'
    definition: '"An intracellular organelle, about 200 A in diameter, con...'
       comment: ''
       synonym: {4x2 cell}
          is_a: [3x1 double]
       part_of: [0x1 double]
      obsolete: 0

This term represents the 'ribosome', and it has fields for is_a and part_of. These fields represent the relationships between Gene Ontology terms. Gene Ontology terms can be seen as nodes in an acyclic graph. You can traverse such relationships with the methods getancestors, getdescendants, getrelatives, and getmatrix. For example, the getancestors method returns any less specific term than 'ribosome' (i.e., its parents in the graph).

ancestors = getancestors(GO,5840)
riboanc = GO(ancestors)
ancestors =


Gene Ontology object with 14 Terms.

To visualize these relationships, use the biograph function and the getmatrix method from the Bioinformatics Toolbox. The getmatrix method returns a square matrix of relationships of the given Gene Ontology object. This graph is sometimes called an 'induced' graph.

cm = getmatrix(riboanc);
BG = biograph(cm,get(riboanc.Terms,'name'))
Biograph object with 14 nodes and 21 edges.

Using Clustering to Select an Interesting Subset of Genes

To show how Gene Ontology information is useful, you will look at microarray data from the Gene Expression Profile Analysis example. This data has 6400 genes on the microarray that are involved with many different aspects of yeast gene expression. A small portion of these might show interesting behavior for this microarray experiment. You will use the Gene Ontology to better understand if and how these genes are related in the cell. The full yeast data can be found at the NCBI Website.

load yeastdata
whos yeastvalues genes
  Name                Size             Bytes  Class     Attributes

  genes            6400x1             806522  cell                
  yeastvalues      6400x7             358400  double              

The example "Gene Expression Profile Analysis" shows several ways to cluster the data from the experiment. In this example, K-means clustering is used to select a group of about 240 genes for study.

First, the data needs cleaning up. There are some empty spots on the chip, and some genes have missing values. In this example the empty spots are removed from the data set and the knnimpute function imputes the missing values (marked with NaNs).

% Remove data for empty spots
emptySpots = strcmp('EMPTY',genes);
yeastvalues = yeastvalues(~emptySpots,:);
genes = genes(~emptySpots);
fprintf('Number of genes after removing empty spots is %d.\n',numel(genes))

% Impute missing values
yeastvalues = knnimpute(yeastvalues);
Number of genes after removing empty spots is 6314.

Next, the function genelowvalfilter removes genes with low overall expression. In this example a fairly large value is used to filter the genes. The Gene Expression Profile Analysis example shows alternative filtering techniques.

mask = genelowvalfilter(yeastvalues,genes,'absval',log2(3.5));
highexpIdx = find(mask);
yeastvalueshighexp = yeastvalues(highexpIdx,:);
fprintf('Number of genes with high expression is %d.\n',numel(highexpIdx))
Number of genes with high expression is 613.

The kmeans function from the Statistics and Machine Learning Toolbox™ groups the data into four clusters using correlation as the distance metric.

rng default

[cidx, ctrs] = kmeans(yeastvalueshighexp, 4, 'dist','corr', 'rep',20);
for c = 1:4
    plot(times,yeastvalueshighexp((cidx == c),:)');
    axis tight
suptitle('K-Means Clustering of Profiles');

The plots show four fairly different clusters. By looking at the centroids of the clusters you can see clearly how they differ.

for c = 1:4
    axis tight
    axis off    % turn off the axis
suptitle('Centroids of K-Means Clustering of Profiles');

The first cluster in the top left corner represents the genes that are up-regulated with their expression levels falling off a little in the final chip. The genes in this cluster will be the subset used for the remainder of this experiment.

clusterIdx = highexpIdx(cidx==1);
fprintf('Number of genes in the first cluster is %d.\n',numel(clusterIdx))
Number of genes in the first cluster is 140.

Getting Annotated Genes from the Saccharomyces Genome Database

Many Genome Projects interact with the Gene Ontology Consortium when annotating genes. Gene annotations for several organisms can be found at the Gene Ontology Website. In addition, annotations for individual organisms can be found at their respective websites (such as the Yeast Genome database). These annotations are updated frequently and are usually curated by members of the genome group for each organism. NCBI also has a collective list of gene annotations that relate to their Entrez Gene database. These annotation files consist of large lists of genes and their associated Gene Ontology terms. These files follow the structure defined by the Gene Ontology Consortium. The function goannotread will parse these uncompressed files and put the information into a MATLAB structure. The file yeastgenes.sgd was obtained from the Gene Ontology Annotation site.

For this analysis you will look at genes that are annotated as molecular function (i.e., the 'Aspect' field is set to 'F'). However, you could extend this analysis to see if the clustered genes are involved in common processes ('P') or are co-located in the same cell compartment ('C'). The fields that are of interest are the gene symbol and associated ID. In GO Annotation files, these have field names DB_Object_Symbol and GOid, respectively. Create a map for efficient search of the Gene Symbol. Observe that not every gene from the 6314 genes on the microarray is annotated. Note that data in public repositories is frequently curated and updated; therefore the results of this example might be slightly different when you use up-to-date datasets. It is also possible that you get warnings about invalid or obsolete IDs due to an outdated yeastgenes.sgd gene annotation file.

SGDann = goannotread('yeastgenes.sgd','Aspect','F',...

SGDmap = containers.Map();
for i = 1:numel(SGDann)
    key = SGDann(i).DB_Object_Symbol;
    if isKey(SGDmap,key)
        SGDmap(key) = [SGDmap(key) SGDann(i).GOid];
        SGDmap(key) = SGDann(i).GOid;

fprintf('Number of annotated genes related to molecular function is %d.\n',SGDmap.Count)
fprintf('Number of unique GO terms associated to annotated genes is %d.\n',numel(unique([SGDann.GOid])))
fprintf('Number of gene-GO term associations is %d.\n',numel(SGDann))
Number of annotated genes related to molecular function is 6381.
Number of unique GO terms associated to annotated genes is 1773.
Number of gene-GO term associations is 25924.

Counting Annotated Genes From the Microarray

For every term in the Gene Ontology, you count the following two items:

  1. The number of genes that are annotated to the term.

  2. The number of under- or overexpressed genes that are annotated to the term.

Based on this information, you can statistically determine how often Gene Ontology terms are over- or underrepresentated in the microarray experiment.

There are some cases where the gene sets are not accurately annotated. Therefore, you create these tallies by propagating to neighboring terms. By definition, more specific Gene Ontology terms are included in ancestor terms. Thus if a gene is annotated to a particular term, you may also want to increase the count for some or all of its ancester terms. Use getrelatives, getdescendants, or getancestors to test different propagation schemes. In this example, you use getrelatives to get ancestors and descendents of each term up to 1 level, which is the default, mostly to overcome an imprecisely annotated gene set.

m = GO.Terms(end).id;           % gets the last term id
geneschipcount = zeros(m,1);    % a vector of GO term counts for the entire chip.
genesclustercount = zeros(m,1); % a vector of GO term counts for interesting genes.
for i = 1:numel(genes)
    if isKey(SGDmap,genes{i})
        goid = getrelatives(GO,SGDmap(genes{i}));
        % update vector counts
        geneschipcount(goid) = geneschipcount(goid) + 1;
        if (any(i == clusterIdx))
           genesclustercount(goid) = genesclustercount(goid) + 1;

Looking at Probability of Gene Ontology Annotation

You can find the most significant annotated terms by looking at the probabilities that the terms are counted by chance. Use the hypergeometric probability distribution function (hygepdf), which calculates the statistical significance of having drawn a specific number of successes out of a total number of draws from a population. The calculated p-value is the probability of obtaining such test statistics, which is the tally counts of the interesting genes in this example. This function returns the p-value associated to each term, and you can create a list of the most significant GO terms by ordering the p-values.

pvalues = hygepdf(genesclustercount,max(geneschipcount),...
[dummy,idx] = sort(pvalues);

% create a report
report = sprintf('GO Term      p-val  counts  definition\n');
for i = 1:10
    term = idx(i);
    report = sprintf('%s%s\t%-1.4f\t%-d / %-d\t%s...\n', report, ...
                    char(num2goid(term)), pvalues(term),...
GO Term      p-val  counts  definition
GO:0000104	0.0196	1 / 1	Catalysis of the reaction: succinate + acceptor = fumarate ...
GO:0000156	0.0196	1 / 1	Responds to a phosphorelay sensor to initiate a change in c...
GO:0003995	0.0196	1 / 1	Catalysis of the reaction: acyl-CoA + acceptor = 2,3-dehydr...
GO:0004871	0.0196	1 / 1	Conveys a signal across a cell to trigger a change in cell ...
GO:0005057	0.0196	1 / 1	Conveys a signal from an upstream receptor or intracellular...
GO:0008177	0.0196	1 / 1	Catalysis of the reaction: succinate + ubiquinone = fumarat...
GO:0009046	0.0196	1 / 1	Catalysis of the cleavage of the D-alanyl-D-alanine bond in...
GO:0009917	0.0196	1 / 1	Catalysis of the removal of a C-5 double bond in the B ring...
GO:0009918	0.0196	1 / 1	Catalysis of the reaction: 5-dehydroepisterol = 24-methylen...
GO:0016166	0.0196	1 / 1	Catalysis of the dehydrogenation of phytoene to produce a c...

Further Analysis of the Most Significant Terms

You can use the methods described earlier in this example to find out more about the terms that appear high on this list.

First look at the ancestors of the top item on the list.

topItem = idx(1);
GO(topItem).terms % the most significant gene
topItemAncestors = getancestors(GO,topItem)
            id: 104
          name: 'succinate dehydrogenase activity'
      ontology: 'molecular function'
    definition: '"Catalysis of the reaction: succinate + acceptor = fumara...'
       comment: ''
       synonym: {11x2 cell}
          is_a: 16627
       part_of: [0x1 double]
      obsolete: 0

topItemAncestors =


Look at the list for the second item to see some of the same ancestors.

secondItem = idx(2);
GO(secondItem).terms % the second most significant gene
secondItemAncestors = getancestors(GO,secondItem)
            id: 156
          name: 'phosphorelay response regulator activity'
      ontology: 'molecular function'
    definition: '"Responds to a phosphorelay sensor to initiate a change i...'
       comment: ''
       synonym: {'narrow_synonym'  '"two-component response regulator activ...'}
          is_a: 4871
       part_of: [0x1 double]
      obsolete: 0

secondItemAncestors =


You can now build a sub-ontology that includes the ancestors of the ten (for example) most significant terms and visualize this using the biograph function.

subGO = GO(getancestors(GO,idx(1:10)))
[cm,acc,rels] = getmatrix(subGO);
BG = biograph(cm,get(subGO.Terms,'name'))
Gene Ontology object with 24 Terms.
Biograph object with 24 nodes and 26 edges.

Use the p-values, calculated before, to assign a color to the graph nodes. In this example an arbitrary color map is used, where bright red is the most significant and bright green is the least significant.

for i = 1:numel(acc)
    pval = pvalues(acc(i));
    color = [(1-pval).^(10),pval.^(1/10),0.3];
    BG.Nodes(i).Color = color;
    BG.Nodes(i).Label = num2str(acc(i)); % add info to datatips





[3] Gentleman, R. 'Basic GO Usage'. Bioconductor vignette May 16, 2005

[4] Gentleman, R. 'Using GO for Statistical Analyses'. Bioconductor vignette May 16, 2005