Documentation Center

  • Trial Software
  • Product Updates

Gene Ontology Enrichment in Microarray Data

This example shows how to enrich microarray gene expression data using the Gene Ontology relationships.

Introduction

Gene Ontology is a controlled method for describing terms related to genes in any organism. As more gene data is obtained from organisms, it is annotated using Gene Ontology. Gene Ontology is made of three smaller ontologies or aspects: Molecular Function, Biological Process, and Cellular Component. Each of these ontologies contains terms that are organized in a directed acyclic graph with these three terms as the roots. The roots are the broadest terms relating to genes. Terms further away from the roots get more specific. For this example you will use microarray data from the Gene Expression Profile Analysis example to look at the significance of interesting genes and Gene Ontology terms that are associated with the microarray probes. More specifically, you will further investigate to determine if a set of genes that cluster together are also involved in a common molecular function.

Examples Using Gene Ontology Functions

The Gene Ontology database is loaded into a MATLAB® object using the Bioinformatics Toolbox geneont function.

GO = geneont('live',true); % this step takes a while
get(GO)
                 default_namespace: 'gene_ontology'
                    format_version: '1.0'
                      data_version: ''
                           version: ''
                              date: '19:01:2013 00:20'
                          saved_by: 'jenkins'
                 auto_generated_by: ''
                         subsetdef: {13x1 cell}
                            import: ''
                    synonymtypedef: ''
                           idspace: ''
    default_relationship_id_prefix: ''
                        id_mapping: ''
                            remark: 'cvs version: use data-version'
                           typeref: ''
                  unrecognized_tag: {2x2 cell}
                             Terms: [38883x1 geneont.term]

Every Gene Ontology term has an accession number which is a seven digit number preceded by the prefix 'GO:'. To look at a specific term you first create a sub-ontology by sub-scripting the object and then inspect the 'terms' property. A hash-table is implemented in the GO object for efficiently looking up term IDs.

GO(5840).terms % the ribosome Gene Ontology term
            id: 5840
          name: 'ribosome'
      ontology: 'cellular component'
    definition: '"An intracellular organelle, about 200 A in diameter, con...'
       comment: ''
       synonym: {4x2 cell}
          is_a: [3x1 double]
       part_of: [0x1 double]
      obsolete: 0

This term represents the 'ribosome', and it has fields for is_a and part_of. These fields represent the relationships between Gene Ontology terms. Gene Ontology terms can be seen as nodes in an acyclic graph. You can traverse such relationships with the methods getancestors, getdescendants, getrelatives, and getmatrix. For example, the getancestors method returns any less specific term than 'ribosome' (i.e., its parents in the graph).

ancestors = getancestors(GO,5840)
riboanc = GO(ancestors)
ancestors =

        5575
        5622
        5623
        5737
        5840
       30529
       32991
       43226
       43228
       43229
       43232
       44424
       44444
       44464

Gene Ontology object with 14 Terms.

To visualize these relationships we use the biograph function and the getmatrix method from the Bioinformatics Toolbox. The getmatrix method returns a square matrix of relationships of the given Gene Ontology object. This graph is sometimes called an 'induced' graph.

cm = getmatrix(riboanc);
BG = biograph(cm,get(riboanc.Terms,'name'))
view(BG)
Biograph object with 14 nodes and 21 edges.

Using Clustering to Select an Interesting Subset of Genes

To show how Gene Ontology information is useful, we will look at microarray data from the Gene Expression Profile Analysis example. This data has 6400 genes on the microarray that are involved with many different aspects of yeast gene expression. A small portion of these might show interesting behavior for this microarray experiment. We will use the Gene Ontology to better understand if and how these genes are related in the cell. The full yeast data can be found at the NCBI Website.

load yeastdata
whos yeastvalues genes
  Name                Size             Bytes  Class     Attributes

  genes            6400x1             806522  cell                
  yeastvalues      6400x7             358400  double              

The example "Gene Expression Profile Analysis" shows several ways to cluster the data from the experiment. In this example, K-means clustering is used to select a group of about 240 genes for study.

First, the data needs cleaning up. There are some empty spots on the chip, and some genes have missing values. In this example the empty spots are removed from the data set and the knnimpute function imputes the missing values (marked with NaNs).

% Remove data for empty spots
emptySpots = strcmp('EMPTY',genes);
yeastvalues = yeastvalues(~emptySpots,:);
genes = genes(~emptySpots);
fprintf('Number of genes after removing empty spots is %d.\n',numel(genes))

% Impute missing values
yeastvalues = knnimpute(yeastvalues);
Number of genes after removing empty spots is 6314.

Next, the function genelowvalfilter removes genes with low overall expression. In this example a fairly large value is used to filter the genes. The Gene Expression Profile Analysis example shows alternative filtering techniques.

mask = genelowvalfilter(yeastvalues,genes,'absval',log2(3.5));
highexpIdx = find(mask);
yeastvalueshighexp = yeastvalues(highexpIdx,:);
fprintf('Number of genes with high expression is %d.\n',numel(highexpIdx))
Number of genes with high expression is 613.

The kmeans function from the Statistics Toolbox™ groups the data into four clusters using correlation as the distance metric.

rng('default');

[cidx, ctrs] = kmeans(yeastvalueshighexp, 4, 'dist','corr', 'rep',20);
for c = 1:4
    subplot(2,2,c);
    plot(times,yeastvalueshighexp((cidx == c),:)');
    axis('tight');
end
suptitle('K-Means Clustering of Profiles');

The plots show four fairly different clusters. By looking at the centroids of the clusters you can see clearly how they differ.

figure
for c = 1:4
    subplot(2,2,c);
    plot(times,ctrs(c,:)','linewidth',4,'color','k');
    axis tight
    axis off    % turn off the axis
end
suptitle('Centroids of K-Means Clustering of Profiles');

The first cluster in the top left corner represents the genes that are up-regulated with their expression levels falling off a little in the final chip. The genes in this cluster will be the subset used for the remainder of this experiment.

clusterIdx = highexpIdx(cidx==1);
fprintf('Number of genes in the first cluster is %d.\n',numel(clusterIdx))
Number of genes in the first cluster is 140.

Getting Annotated Genes from the Saccharomyces Genome Database

Many Genome Projects interact with the Gene Ontology Consortium when annotating genes. Gene annotations for several organisms can be found at the Gene Ontology Website. In addition, annotations for individual organisms can be found at their respective websites (such as the Yeast Genome database). These annotations are updated frequently and are usually curated by members of the genome group for each organism. NCBI also has a collective list of gene annotations that relate to their Entrez Gene database. These annotation files consist of large lists of genes and their associated Gene Ontology terms. These files follow the structure defined by the Gene Ontology Consortium. The function goannotread will parse these uncompressed files and put the information into a MATLAB structure. The file yeastgenes.sgd was obtained from the Gene Ontology Annotation site.

For this analysis you will look at genes that are annotated as molecular function (i.e., the 'Aspect' field is set to 'F'). However, you could extend this analysis to see if the clustered genes are involved in common processes ('P') or are co-located in the same cell compartment ('C'). The fields that are of interest are the gene symbol and associated ID. In GO Annotation files, these have field names DB_Object_Symbol and GOid, respectively. Create a map for efficient search of the Gene Symbol. Observe that not every gene from the 6314 genes on the microarray is annotated. Note that data in public repositories is frequently curated and updated; therefore the results of this example might be slightly different when you use up-to-date datasets. It is also possible that you get warnings about invalid or obsolete IDs due to an updated yeastgenes.sgd gene annotation file.

SGDann = goannotread('yeastgenes.sgd','Aspect','F',...
                     'Fields',{'DB_Object_Symbol','GOid'});

SGDmap = containers.Map();
for i=1:numel(SGDann)
    key = SGDann(i).DB_Object_Symbol;
    if isKey(SGDmap,key)
        SGDmap(key) = [SGDmap(key) SGDann(i).GOid];
    else
        SGDmap(key) = SGDann(i).GOid;
    end
end

fprintf('Number of annotated genes related to molecular function is %d.\n',SGDmap.Count)
fprintf('Number of unique GO terms associated to annotated genes is %d.\n',numel(unique([SGDann.GOid])))
fprintf('Number of gene-GO term associations is %d.\n',numel(SGDann))
Number of annotated genes related to molecular function is 6381.
Number of unique GO terms associated to annotated genes is 1773.
Number of gene-GO term associations is 25924.

Counting Annotated Genes From the Microarray

We will keep a tally of the number of genes that are annotated for every Gene Ontology term. At the same time, we will keep track of how many Gene Ontology terms have interesting genes associated with them from above. You can use this information to determine how often Gene Ontology terms are represented in the microarray experiment.

m = GO.Terms(end).id;           % gets the last term id
geneschipcount = zeros(m,1);    % a vector of GO term counts for the entire chip.
genesclustercount = zeros(m,1); % a vector of GO term counts for interesting genes.
for i = 1:numel(genes)
    if isKey(SGDmap,genes{i})
        goid = getrelatives(GO,SGDmap(genes{i}));
        % update vector counts
        geneschipcount(goid) = geneschipcount(goid) + 1;
        if (any(i == clusterIdx))
           genesclustercount(goid) = genesclustercount(goid) + 1;
        end
    end
end

Looking at Probability of Gene Ontology Annotation

You can find the most significant annotated terms by looking at the probabilities that the terms are counted by chance. For this you can use the hypergeometric probability distribution function (hygepdf). This function returns the p-value associated to each term, you can create a list of the most significant GO terms by ordering the p-values.

pvalues = hygepdf(genesclustercount,max(geneschipcount),...
                  max(genesclustercount),geneschipcount);
[dummy idx] = sort(pvalues);

% create a report
report = sprintf('GO Term      p-val  counts  definition\n');
for i = 1:10
    term = idx(i);
    report = sprintf('%s%s\t%-1.4f\t%-d / %-d\t%s...\n', report, ...
                    char(num2goid(term)), pvalues(term),...
                    genesclustercount(term),geneschipcount(term),...
                    GO(term).Term.definition(2:min(end,60)));
end
disp(report);
GO Term      p-val  counts  definition
GO:0000104	0.0196	1 / 1	Catalysis of the reaction: succinate + acceptor = fumarate ...
GO:0000156	0.0196	1 / 1	Responds to a phosphorelay sensor to initiate a change in c...
GO:0003995	0.0196	1 / 1	Catalysis of the reaction: acyl-CoA + acceptor = 2,3-dehydr...
GO:0004871	0.0196	1 / 1	Conveys a signal across a cell to trigger a change in cell ...
GO:0005057	0.0196	1 / 1	Conveys a signal from an upstream receptor or intracellular...
GO:0008177	0.0196	1 / 1	Catalysis of the reaction: succinate + ubiquinone = fumarat...
GO:0009046	0.0196	1 / 1	Catalysis of the cleavage of the D-alanyl-D-alanine bond in...
GO:0009917	0.0196	1 / 1	Catalysis of the removal of a C-5 double bond in the B ring...
GO:0009918	0.0196	1 / 1	Catalysis of the reaction: 5-dehydroepisterol = 24-methylen...
GO:0016166	0.0196	1 / 1	Catalysis of the dehydrogenation of phytoene to produce a c...

Further Analysis of the Most Significant Terms

You can use the methods described earlier in this example to find out more about the terms that appear high on this list.

First look at the ancestors of the top item on the list.

topItem = idx(1);
GO(topItem).terms % the most significant gene
topItemAncestors = getancestors(GO,topItem)
            id: 104
          name: 'succinate dehydrogenase activity'
      ontology: 'molecular function'
    definition: '"Catalysis of the reaction: succinate + acceptor = fumara...'
       comment: ''
       synonym: {11x2 cell}
          is_a: 16627
       part_of: [0x1 double]
      obsolete: 0


topItemAncestors =

         104
        3674
        3824
       16491
       16627

Note that the term 16616 appears as one of the ancestors. If you look at the list for the second item, you will see many of the same ancestors.

secondItem = idx(2);
GO(secondItem).terms % the second most significant gene
secondItemAncestors = getancestors(GO,secondItem)
            id: 156
          name: 'phosphorelay response regulator activity'
      ontology: 'molecular function'
    definition: '"Responds to a phosphorelay sensor to initiate a change i...'
       comment: ''
       synonym: {1x2 cell}
          is_a: 4871
       part_of: [0x1 double]
      obsolete: 0


secondItemAncestors =

         156
        3674
        4871
       60089

You can now build a sub-ontology that includes the ancestors of the ten (for example) most significant terms and visualize this using the biograph function.

subGO = GO(getancestors(GO,idx(1:10)))
[cm acc rels] = getmatrix(subGO);
BG = biograph(cm,get(subGO.Terms,'name'))
Gene Ontology object with 24 Terms.
Biograph object with 24 nodes and 26 edges.

Use the p-values, calculated before, to assign a color to the graph nodes. In this example an arbitrary color map is used, where bright red is the most significant and bright green is the least significant.

for i=1:numel(acc)
    pval = pvalues(acc(i));
    color = [(1-pval).^(10),pval.^(1/10),0.3];
    set(BG.Nodes(i),'Color',color);
    set(BG.Nodes(i),'Label',num2str(acc(i))) % add info to datatips
end

view(BG);

References

[1] http://www.geneontology.org/

[2] http://www.yeastgenome.org/

[3] Gentleman, R. 'Basic GO Usage'. Bioconductor vignette May 16, 2005 http://bioconductor.org/docs/vignettes.html

[4] Gentleman, R. 'Using GO for Statistical Analyses'. Bioconductor vignette May 16, 2005 http://bioconductor.org/docs/vignettes.html

Provide feedback for this example.

Was this topic helpful?