Skip to Main Content Skip to Search
Accelerating the pace of engineering and science

 

Exploring Gene Expression Data

This demonstration uses MATLAB® and Bioinformatics Toolbox™ functions to identify differentially expressed genes, and then uses Gene Ontology to determine the biological functions of the differentially expressed genes.

Contents

Introduction

Microarrays contain oligonucleotide or cDNA probes for comparing the expression profile of genes on a genomic scale. Determining if changes in gene expression are statistically significant between different conditions, e.g. two different tumor types, and determining the biological function of the differentially expressed genes, are important aims in a microarray experiment.

A publicly available dataset containing gene expression data of 42 tumor tissues of the embryonal central nervous system (CNS, Pomeroy et al. 2002) is used for this demonstration. The samples were hybridized on Affymetrix® HuGeneFL GeneChip® arrays.

The CNS dataset (CEL files) is available at the CNS experiment web site. The 42 tumor tissue samples include 10 medulloblastomas, 10 rhabdoid, 10 malignant glioma, 8 supratentorial PNETS, and 4 normal human cerebella. The CNS raw dataset was preprocessed with the Robust Multi-array Average (RMA) and GC Robust Multi-array Average (GCRMA) procedures. For further information on Affymetrix oligonucleotide microarray preprocessing, see Preprocessing Affymetrix Microarray Data at the Probe Level.

You will use the t-test and false discovery rate to detect differentially expressed genes between two of the tumor types. Additionally, you will look at Gene Ontology terms related to the significantly up-regulated genes.

Loading the Expression Data

Load the MAT file cnsexpressiondata containing three DataMatrix objects. Gene expression values preprocessed by RMA and GCRMA (MLE and EB) procedures are stored in the DataMatrix objects expr_cns_rma, expr_cns_gcrma_mle, and expr_cns_gcrma_eb respectively.

load cnsexpressiondata

In each DataMatrix object, each row corresponds to a probe set on the HuGeneFl array, and each column corresponds to a sample. The row names are the probe set IDs and column names are the sample names. The DataMatrix object expr_cns_gcrma_eb will be used in this demonstration. You can use either of the other two expression variables as well.

You can get the properties of the DataMatrix object expr_cns_gcrma_eb using the get command.

get(expr_cns_gcrma_eb)
            Name: ''
        RowNames: {7129x1 cell}
        ColNames: {1x42 cell}
           NRows: 7129
           NCols: 42
           NDims: 2
    ElementClass: 'single'

Determine the number of genes and number of samples in the DataMatrix object expr_cns_gcrma_eb.

[nGenes, nSamples] = size(expr_cns_gcrma_eb)
nGenes =

        7129


nSamples =

    42

You can use gene symbols instead of the probe set IDs to label the expression values. The gene symbols for the HuGeneFl array are provided in a MAT file containing a Map object.

load HuGeneFL_GeneSymbol_Map;

Create a cell array of gene symbols for the expression values from the hu6800GeneSymbolMap object.

huGenes = values(hu6800GeneSymbolMap, expr_cns_gcrma_eb.RowNames);

Set the row names of the exprs_cns_gcrma_eb to gene symbols using the rownames method of the DataMatrix object.

expr_cns_gcrma_eb = rownames(expr_cns_gcrma_eb, ':', huGenes);

Filtering the Expression Data

Remove gene expression data with empty gene symbols. In the example, the empty symbols are labeled as '---'.

expr_cns_gcrma_eb('---', :) = [];

Many of the genes in this study are not expressed, or have only small variability across the samples. Remove these genes using non-specific filtering.

Use genelowvalfilter to filter out genes with very low absolute expression values.

[mask, expr_cns_gcrma_eb] = genelowvalfilter(expr_cns_gcrma_eb);

Use genevarfilter to filter out genes with a small variance across samples.

[mask, expr_cns_gcrma_eb] = genevarfilter(expr_cns_gcrma_eb);

Determine the number of genes after filtering.

nGenes = expr_cns_gcrma_eb.NRows
nGenes =

        5669

Identifying Differential Gene Expression

You can now compare the gene expression values between two groups of data: CNS medulloblastomas (MD) and non-neuronal origin malignant gliomas (Mglio) tumor.

From the expression data of all 42 samples, extract the data of the 10 MD samples and the 10 Mglio samples.

MDs = strncmp(expr_cns_gcrma_eb.ColNames,'Brain_MD', 8);
Mglios = strncmp(expr_cns_gcrma_eb.ColNames,'Brain_MGlio', 11);
MDData = expr_cns_gcrma_eb(:, MDs);
get(MDData)
            Name: ''
        RowNames: {5669x1 cell}
        ColNames: {1x10 cell}
           NRows: 5669
           NCols: 10
           NDims: 2
    ElementClass: 'single'

MglioData = expr_cns_gcrma_eb(:, Mglios);
get(MglioData)
            Name: ''
        RowNames: {5669x1 cell}
        ColNames: {1x10 cell}
           NRows: 5669
           NCols: 10
           NDims: 2
    ElementClass: 'single'

A standard statistical test for detecting significant changes between the measurement of a variable in two groups is the t-test. Conduct a t-test for each gene to identify significant changes in expression values between the MD samples and Mglio samples. You can inspect the test results from the normal quantile plot of t-scores and the histograms of t-scores and p-values of the t-tests.

[pvalues, tscores] = mattest(MDData, MglioData,...
                                'Showhist', true', 'Showplot', true);
A normal quantile plot of results from a t-test in gene expression analysis. Histograms of t-scores and p-values of the t-tests performed as part of gene expression analysis.

In any test situation, two types of errors can occur, a false positive by declaring that a gene is differentially expressed when it is not, and a false negative when the test fails to identify a truly differentially expressed gene. In multiple hypothesis testing, which simultaneously tests the null hypothesis of thousands of genes using microarray expression data, each test has a specific false positive rate, or a false discovery rate (FDR). False discovery rate is defined as the expected ratio of the number of false positives to the total number of positive calls in a differential expression analysis between two groups of samples (Storey et al., 2003).

In this demonstration, you will compute the FDR using the Storey-Tibshirani procedure (Storey et al., 2003). The procedure also computes the q-value of a test, which measures the minimum FDR that occurs when calling the test significant. The estimation of FDR depends on the truly null distribution of the multiple tests, which is unknown. Permutation methods can be used to estimate the truly null distribution of the test statistics by permuting the columns of the gene expression data matrix (Storey et al., 2003, Dudoit et al., 2003). Depending on the sample size, it may not be feasible to consider all possible permutations. Usually a random subset of permutations are considered in the case of large sample size. Use the nchoosek function in Statistics Toolbox™ to find out the number of all possible permutations of the samples in this demonstration.

all_possible_perms = nchoosek(1:MDData.NCols+MglioData.NCols, MDData.NCols);
size(all_possible_perms, 1)
ans =

      184756

Perform a permutation t-test using mattest and the PERMUTE option to compute the p-values of 10,000 permutations by permuting the columns of the gene expression data matrix of MDData and MglioData (Dudoit et al., 2003).

pvaluesCorr = mattest(MDData, MglioData, 'Permute', 10000);

Determine the number of genes considered to have statistical significance at the p-value cutoff of 0.05. Note: You may get a different number of genes due to the permutation test outcome.

cutoff = 0.05;
sum(pvaluesCorr < cutoff)
ans =

        2125

Estimate the FDR and q-values for each test using mafdr. The quantity pi0 is the overall proportion of true null hypotheses in the study. It is estimated from the simulated null distribution via bootstrap or the cubic polynomial fit. Note: You can also manually set the value of lambda for estimating pi0.

figure;
[pFDR, qvalues] = mafdr(pvaluesCorr, 'showplot', true);
Estimated q-values and overall proportion of true null hypotheses in the gene expression analysis study.

Determine the number of genes that have q-values less than the cutoff value. Note: You may get a different number of genes due to the permutation test and the bootstrap outcomes.

sum(qvalues < cutoff)
ans =

        2134

Many genes with low FDR implies that the two groups, MD and Mglio, are biologically distinct.

You can also empirically estimate the FDR adjusted p-values using the Benjamini-Hochberg (BH) procedure (Benjamini et al, 1995) by setting the mafdr input parameter BHFDR to true.

pvaluesBH = mafdr(pvaluesCorr, 'BHFDR', true);
sum(pvaluesBH < cutoff)
ans =

        1375

You can store the t-scores, p-values, pFDRs, q-values and BH FDR corrected p-values together as a DataMatrix object.

testResults = [tscores pvaluesCorr pFDR qvalues pvaluesBH];

Update the column name for BH FDR corrected p-values using the colnames method of DataMatrix object.

testResults = colnames(testResults, 5, {'FDR_BH'});

You can sort by p-values pvaluesCorr using the sortrows mathod.

testResults = sortrows(testResults, 2);

Display the first 23 genes in testResults. Note: Your results may be different from those shown below due to the permutation test and the bootstrap outcomes.

testResults(1:23, :)
ans =

                t-scores    p-values       FDR            q-values
    RAB31       -13.664     1.2256e-008    2.5806e-005     2.572e-005
    PLEC1       -9.6223     5.6687e-008    5.9681e-005     2.572e-005
    HNRPA1        9.359     7.5912e-008    5.3281e-005     2.572e-005
    FCGR2A      -9.3548     7.6023e-008    4.0019e-005     2.572e-005
    PLEC1       -9.3495     7.6167e-008    3.2076e-005     2.572e-005
    FBL          9.1518     1.1161e-007    3.9166e-005     2.572e-005
    KIAA0367     -8.996     1.2581e-007    3.7845e-005     2.572e-005
    ID2B        -8.9285      1.307e-007    3.4402e-005     2.572e-005
    RBMX         8.8905     1.3305e-007    3.1128e-005     2.572e-005
    PAFAH1B3     8.7561     1.4024e-007    2.9529e-005     2.572e-005
    H3F3A        8.6512     1.4627e-007    2.7999e-005     2.572e-005
    LRP1        -8.6465     1.4658e-007     2.572e-005     2.572e-005
    PEA15       -8.3256     1.8957e-007    3.0704e-005    3.0704e-005
    ID2B        -8.1183      2.281e-007    3.4307e-005    3.2051e-005
    SFRS3        8.1166     2.2833e-007    3.2051e-005    3.2051e-005
    HLA-DPA1    -7.8546     3.6101e-007    4.7509e-005    4.7509e-005
    C5orf13      7.7195     4.8936e-007    6.0612e-005     5.827e-005
    PTMA         7.7013     4.9812e-007     5.827e-005     5.827e-005
    NAP1L1        7.674     5.3726e-007    5.9541e-005    5.8442e-005
    HMGB2        7.6532      5.551e-007    5.8442e-005    5.8442e-005
    ARAF        -7.5549     6.1454e-007    6.1618e-005     5.929e-005
    PTPRZ1      -7.5352     6.1948e-007     5.929e-005     5.929e-005
    SPARCL1     -7.3639     8.1963e-007    7.5036e-005    7.3354e-005


                FDR_BH
    RAB31       6.9245e-005
    PLEC1       6.9245e-005
    HNRPA1      6.9245e-005
    FCGR2A      6.9245e-005
    PLEC1       6.9245e-005
    FBL         6.9245e-005
    KIAA0367    6.9245e-005
    ID2B        6.9245e-005
    RBMX        6.9245e-005
    PAFAH1B3    6.9245e-005
    H3F3A       6.9245e-005
    LRP1        6.9245e-005
    PEA15       8.2665e-005
    ID2B        8.6292e-005
    SFRS3       8.6292e-005
    HLA-DPA1     0.00012791
    C5orf13      0.00015688
    PTMA         0.00015688
    NAP1L1       0.00015734
    HMGB2        0.00015734
    ARAF         0.00015963
    PTPRZ1       0.00015963
    SPARCL1      0.00019749

A gene is considered to be differentially expressed between the two groups of samples if it shows both statistical and biological significance. In this demonstration, you will compare the gene expression ratio of MD over Mglio tumor samples. Therefore an up-regulated gene in this demo has higher expression in MD and down-regulate gene has higher expression in Mglio.

Plot the –log10 of p-values against the biological effect in a volcano plot. Note: From the volcano plot UI, you can interactively change the p-value cutoff and fold change limit, and export differentially expressed genes.

diffStruct = mavolcanoplot(MDData, MglioData, pvaluesCorr)
diffStruct =

           Name: 'Differentially Expressed'
       PVCutoff: 0.0500
    FCThreshold: 2
     GeneLabels: {327x1 cell}
        PValues: [327x1 bioma.data.DataMatrix]
    FoldChanges: [327x1 bioma.data.DataMatrix]

A volcano plot plotting -log10 of the p-values aainst the biological effect is a useful visualization tool in gene expression analysis.

Ctrl-click genes in the gene lists to label the genes in the plot. As seen in the volcano plot, genes specific for neuronal based cerebella granule cells, such as ZIC and NEUROD, are found in the up-regulated gene list, while genes typical of the astrocytic and oligodendrocytic lineage and cell differentiation, such as SOX2, PEA15, and ID2B, are found in the down-regulated list.

Determine the number of differentially expressed genes.

nDiffGenes = diffStruct.PValues.NRows
nDiffGenes =

   327

Get the list of up-regulated genes for MD compared to Mglio.

up_geneidx = find(diffStruct.FoldChanges > 0);
up_genes = rownames(diffStruct.FoldChanges, up_geneidx);
nUpGenes = length(up_geneidx)
nUpGenes =

   225

Determine the number of down-regulated genes for MD compared to Mglio.

nDownGenes = sum(diffStruct.FoldChanges < 0)
nDownGenes =

   102

Annotating Up-Regulated Genes Using Gene Ontology

Use Gene Ontology (GO) to annotate the differentially expressed genes. You can look at the up-regulated genes from the analysis above. Download the Homo sapiens annotations (gene_association.goa_human.gz file) from Gene Ontology Current Annotations, unzip, and store it in your the current directory.

Find the indices of the up-regulated genes for Gene Ontology analysis.

huGenes = rownames(expr_cns_gcrma_eb);
for i = 1:nUpGenes
    up_geneidx(i) = find(strncmpi(huGenes, up_genes{i}, length(up_genes
{i})), 1);
end

Load the Gene Ontology database into a MATLAB object using the geneont function.

GO = geneont('live',true);

Read the Homo sapiens gene annotation file. For this demonstration, you will look only at genes that are related to molecular function, so you only need to read the information where the Aspect field is set to 'F'. The fields that are of interest are the gene symbol and associated ID. In GO Annotation files these have field names DB_Object_Symbol and GOid respectively.

HGann = goannotread('gene_association.goa_human',...
    'Aspect','F','Fields',{'DB_Object_Symbol','GOid'});

Create a map between annotated genes and GO terms.

HGmap = containers.Map();
for i=1:numel(HGann)
    key = HGann(i).DB_Object_Symbol;
    if isKey(HGmap,key)
        HGmap(key) = [HGmap(key) HGann(i).GOid];
    else
        HGmap(key) = HGann(i).GOid;
    end
end

Determine the number of Homo sapiens annotated genes related to molecular function.

HGmap.Count
ans =

                15561

Not all of the 5758 genes on the HuGeneFL chip are annotated. For every gene on the chip, see if it is annotated by comparing its gene symbol to the list of gene symbols from GO. Track the number of annotated genes and the number of up-regulated genes associated with each GO term. Note: You might get warnings about invalid or obsolete IDs due to the frequent update to the Homo sapiens gene annotation file.

m = GO.Terms(end).id;        % gets the last term id
chipgenesCount = zeros(m,1); % a vector of GO term counts for the entire chi
p.
upgenesCount  = zeros(m,1);  % a vector of GO term counts for up-regulated g
enes.
for i = 1:length(huGenes)
    if isKey(HGmap,huGenes{i})
        goid = getrelatives(GO,HGmap(huGenes{i}));
        % Update the tally
        chipgenesCount(goid) = chipgenesCount(goid) + 1;
        if (any(i == up_geneidx))
            upgenesCount(goid) = upgenesCount(goid) +1;
        end
    end
end

Determine the statistically significant GO terms using the hypergeometric probability distribution. For each GO term, a p-value is calculated representing the probability that the number of annotated genes associated with it could have been found by chance.

gopvalues = hygepdf(upgenesCount,max(chipgenesCount),...
                        max(upgenesCount),chipgenesCount);
[dummy, idx] = sort(gopvalues);

report = sprintf('GO Term     p-value     counts      definition\n');
for i = 1:10
    term = idx(i);
    report = sprintf('%s%s\t%-1.5f\t%3d / %3d\t%s...\n',...
             report, char(num2goid(term)), gopvalues(term),...
             upgenesCount(term), chipgenesCount(term),...
            GO(term).Term.definition(2:min(50,end)));
end
disp(report);
GO Term     p-value     counts      definition
GO:0005515    0.00000    119 / 2984    Interacting selectively with any prot
ein or prote...
GO:0003735    0.00000     56 / 151    The action of a molecule that contribu
tes to the ...
GO:0019843    0.00000     52 / 214    Interacting selectively with ribosomal
 RNA." [GOC...
GO:0003723    0.00000     59 / 299    Interacting selectively with an RNA mo
lecule or a...
GO:0003729    0.00000     50 / 225    Interacting selectively with pre-messe
nger RNA (p...
GO:0008312    0.00000     48 / 208    Interacting selectively with 7S RNA, t
he RNA comp...
GO:0000049    0.00000     47 / 211    Interacting selectively with transfer 
RNA." [GOC:...
GO:0000498    0.00000     46 / 204    Interacting selectively with ribonucle
ic acid (RN...
GO:0030515    0.00000     46 / 204    Interacting selectively with small nuc
leolar RNA....
GO:0033204    0.00000     46 / 204    Interacting selectively with the RNA s
ubunit of r...

Inspect the significant GO terms and select the terms related to specific molecule functions to build a sub-ontology that includes the ancestors of the terms. Visualize this ontology using the biograph function. You can also color the graphs nodes. In this example, the red nodes are the most significant, while the blue nodes are the least significant gene ontology terms. Note: The GO terms returned may differ from those shown due to the frequent update to the Homo sapiens gene annotation file.

fcnAncestors = GO(getancestors(GO,idx(1:5)))
[cm acc rels] = getmatrix(fcnAncestors);
BG = biograph(cm,get(fcnAncestors.Terms,'name'))

for i=1:numel(acc)
    pval = gopvalues(acc(i));
    color = [(1-pval).^(1) pval.^(1/8) pval.^(1/8)];
    set(BG.Nodes(i),'Color',color);
end
view(BG)
Gene Ontology object with 9 Terms.
Biograph object with 9 nodes and 8 edges.
Visualization of the  ontology using the biograph function.

Finding the Differentially Expressed Genes in Pathways

You can query the pathway information of the differentially expressed genes from the KEGG pathway database through KEGG's SOAP Web Service (For more information, see Connecting to the KEGG API Web Service), or by simply passing the list of gene symbols to KEGG's Web query tool.

Following are a few pathway maps with the genes in the up-regulated gene list highlighted:

Cell Cycle

Cell Communication

Hedgehog Signaling pathway

mTor Signaling pathway

References

[1] Pomeroy, S.L., Tamayo, P., Gaasenbeek, M., Sturla, L.M., Angelo, M., McLaughlin, M.E., Kim, J.Y., Goumnerova, L.C., Black, P.M., Lau, C., Allen, J.C., Zagzag, D., Olson, J.M., Curran, T., Wetmore, C., Biegel, J.A., Poggio, T., Mukherjee, S., Rifkin, R., Califano, A., Stolovitzky, G., Louis, DN, Mesirov, J.P., Lander, E.S., and Golub, T.R. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415(6870), 436-442.

[2] Storey, J.D., and Tibshirani, R. (2003). Statistical significance for genomewide studies. Proc.Nat.Acad.Sci., 100(16), 9440-9445.

[3] Dudoit, S., Shaffer, J.P., and Boldrick, J.C. (2003). Multiple hypothesis testing in microarray experiment. Statistical Science, 18, 71-103.

[4] Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc., B 57, 289-300.

Suggest an enhancement for this demo.

Free Computational Biology Interactive Kit

See how to analyze, visualize, and model biological data and systems using MathWorks products.

Get free kit

Trials Available

Try the latest computational biology products.

Get trial software
Contact sales
Free technical kit
Trial software

Get Pricing and
Licensing Options