Exploring Gene Expression Data
This demonstration uses MATLAB® and Bioinformatics Toolbox™ functions to identify differentially expressed genes, and then uses Gene Ontology to determine the biological functions of the differentially expressed genes.
Contents
Introduction
Microarrays contain oligonucleotide or cDNA probes for comparing the expression profile of genes on a genomic scale. Determining if changes in gene expression are statistically significant between different conditions, e.g. two different tumor types, and determining the biological function of the differentially expressed genes, are important aims in a microarray experiment.
A publicly available dataset containing gene expression data of 42 tumor tissues of the embryonal central nervous system (CNS, Pomeroy et al. 2002) is used for this demonstration. The samples were hybridized on Affymetrix® HuGeneFL GeneChip® arrays.
The CNS dataset (CEL files) is available at the CNS experiment web site. The 42 tumor tissue samples include 10 medulloblastomas, 10 rhabdoid, 10 malignant glioma, 8 supratentorial PNETS, and 4 normal human cerebella. The CNS raw dataset was preprocessed with the Robust Multi-array Average (RMA) and GC Robust Multi-array Average (GCRMA) procedures. For further information on Affymetrix oligonucleotide microarray preprocessing, see Preprocessing Affymetrix Microarray Data at the Probe Level.
You will use the t-test and false discovery rate to detect differentially expressed genes between two of the tumor types. Additionally, you will look at Gene Ontology terms related to the significantly up-regulated genes.
Loading the Expression Data
Load the MAT file cnsexpressiondata containing three DataMatrix objects. Gene expression values preprocessed by RMA and GCRMA (MLE and EB) procedures are stored in the DataMatrix objects expr_cns_rma, expr_cns_gcrma_mle, and expr_cns_gcrma_eb respectively.
load cnsexpressiondata
In each DataMatrix object, each row corresponds to a probe set on the HuGeneFl array, and each column corresponds to a sample. The row names are the probe set IDs and column names are the sample names. The DataMatrix object expr_cns_gcrma_eb will be used in this demonstration. You can use either of the other two expression variables as well.
You can get the properties of the DataMatrix object expr_cns_gcrma_eb using the get command.
get(expr_cns_gcrma_eb)
Name: ''
RowNames: {7129x1 cell}
ColNames: {1x42 cell}
NRows: 7129
NCols: 42
NDims: 2
ElementClass: 'single'
Determine the number of genes and number of samples in the DataMatrix object expr_cns_gcrma_eb.
[nGenes, nSamples] = size(expr_cns_gcrma_eb)
nGenes =
7129
nSamples =
42
You can use gene symbols instead of the probe set IDs to label the expression values. The gene symbols for the HuGeneFl array are provided in a MAT file containing a Map object.
load HuGeneFL_GeneSymbol_Map;
Create a cell array of gene symbols for the expression values from the hu6800GeneSymbolMap object.
huGenes = values(hu6800GeneSymbolMap, expr_cns_gcrma_eb.RowNames);
Set the row names of the exprs_cns_gcrma_eb to gene symbols using the rownames method of the DataMatrix object.
expr_cns_gcrma_eb = rownames(expr_cns_gcrma_eb, ':', huGenes);
Filtering the Expression Data
Remove gene expression data with empty gene symbols. In the example, the empty symbols are labeled as '---'.
expr_cns_gcrma_eb('---', :) = [];
Many of the genes in this study are not expressed, or have only small variability across the samples. Remove these genes using non-specific filtering.
Use genelowvalfilter to filter out genes with very low absolute expression values.
[mask, expr_cns_gcrma_eb] = genelowvalfilter(expr_cns_gcrma_eb);
Use genevarfilter to filter out genes with a small variance across samples.
[mask, expr_cns_gcrma_eb] = genevarfilter(expr_cns_gcrma_eb);
Determine the number of genes after filtering.
nGenes = expr_cns_gcrma_eb.NRows
nGenes =
5669
Identifying Differential Gene Expression
You can now compare the gene expression values between two groups of data: CNS medulloblastomas (MD) and non-neuronal origin malignant gliomas (Mglio) tumor.
From the expression data of all 42 samples, extract the data of the 10 MD samples and the 10 Mglio samples.
MDs = strncmp(expr_cns_gcrma_eb.ColNames,'Brain_MD', 8); Mglios = strncmp(expr_cns_gcrma_eb.ColNames,'Brain_MGlio', 11);
MDData = expr_cns_gcrma_eb(:, MDs); get(MDData)
Name: ''
RowNames: {5669x1 cell}
ColNames: {1x10 cell}
NRows: 5669
NCols: 10
NDims: 2
ElementClass: 'single'
MglioData = expr_cns_gcrma_eb(:, Mglios); get(MglioData)
Name: ''
RowNames: {5669x1 cell}
ColNames: {1x10 cell}
NRows: 5669
NCols: 10
NDims: 2
ElementClass: 'single'
A standard statistical test for detecting significant changes between the measurement of a variable in two groups is the t-test. Conduct a t-test for each gene to identify significant changes in expression values between the MD samples and Mglio samples. You can inspect the test results from the normal quantile plot of t-scores and the histograms of t-scores and p-values of the t-tests.
[pvalues, tscores] = mattest(MDData, MglioData,... 'Showhist', true', 'Showplot', true);
In any test situation, two types of errors can occur, a false positive by declaring that a gene is differentially expressed when it is not, and a false negative when the test fails to identify a truly differentially expressed gene. In multiple hypothesis testing, which simultaneously tests the null hypothesis of thousands of genes using microarray expression data, each test has a specific false positive rate, or a false discovery rate (FDR). False discovery rate is defined as the expected ratio of the number of false positives to the total number of positive calls in a differential expression analysis between two groups of samples (Storey et al., 2003).
In this demonstration, you will compute the FDR using the Storey-Tibshirani procedure (Storey et al., 2003). The procedure also computes the q-value of a test, which measures the minimum FDR that occurs when calling the test significant. The estimation of FDR depends on the truly null distribution of the multiple tests, which is unknown. Permutation methods can be used to estimate the truly null distribution of the test statistics by permuting the columns of the gene expression data matrix (Storey et al., 2003, Dudoit et al., 2003). Depending on the sample size, it may not be feasible to consider all possible permutations. Usually a random subset of permutations are considered in the case of large sample size. Use the nchoosek function in Statistics Toolbox™ to find out the number of all possible permutations of the samples in this demonstration.
all_possible_perms = nchoosek(1:MDData.NCols+MglioData.NCols, MDData.NCols); size(all_possible_perms, 1)
ans =
184756
Perform a permutation t-test using mattest and the PERMUTE option to compute the p-values of 10,000 permutations by permuting the columns of the gene expression data matrix of MDData and MglioData (Dudoit et al., 2003).
pvaluesCorr = mattest(MDData, MglioData, 'Permute', 10000);
Determine the number of genes considered to have statistical significance at the p-value cutoff of 0.05. Note: You may get a different number of genes due to the permutation test outcome.
cutoff = 0.05; sum(pvaluesCorr < cutoff)
ans =
2125
Estimate the FDR and q-values for each test using mafdr. The quantity pi0 is the overall proportion of true null hypotheses in the study. It is estimated from the simulated null distribution via bootstrap or the cubic polynomial fit. Note: You can also manually set the value of lambda for estimating pi0.
figure;
[pFDR, qvalues] = mafdr(pvaluesCorr, 'showplot', true);
Determine the number of genes that have q-values less than the cutoff value. Note: You may get a different number of genes due to the permutation test and the bootstrap outcomes.
sum(qvalues < cutoff)
ans =
2134
Many genes with low FDR implies that the two groups, MD and Mglio, are biologically distinct.
You can also empirically estimate the FDR adjusted p-values using the Benjamini-Hochberg (BH) procedure (Benjamini et al, 1995) by setting the mafdr input parameter BHFDR to true.
pvaluesBH = mafdr(pvaluesCorr, 'BHFDR', true);
sum(pvaluesBH < cutoff)
ans =
1375
You can store the t-scores, p-values, pFDRs, q-values and BH FDR corrected p-values together as a DataMatrix object.
testResults = [tscores pvaluesCorr pFDR qvalues pvaluesBH];
Update the column name for BH FDR corrected p-values using the colnames method of DataMatrix object.
testResults = colnames(testResults, 5, {'FDR_BH'});
You can sort by p-values pvaluesCorr using the sortrows mathod.
testResults = sortrows(testResults, 2);
Display the first 23 genes in testResults. Note: Your results may be different from those shown below due to the permutation test and the bootstrap outcomes.
testResults(1:23, :)
ans =
t-scores p-values FDR q-values
RAB31 -13.664 1.2256e-008 2.5806e-005 2.572e-005
PLEC1 -9.6223 5.6687e-008 5.9681e-005 2.572e-005
HNRPA1 9.359 7.5912e-008 5.3281e-005 2.572e-005
FCGR2A -9.3548 7.6023e-008 4.0019e-005 2.572e-005
PLEC1 -9.3495 7.6167e-008 3.2076e-005 2.572e-005
FBL 9.1518 1.1161e-007 3.9166e-005 2.572e-005
KIAA0367 -8.996 1.2581e-007 3.7845e-005 2.572e-005
ID2B -8.9285 1.307e-007 3.4402e-005 2.572e-005
RBMX 8.8905 1.3305e-007 3.1128e-005 2.572e-005
PAFAH1B3 8.7561 1.4024e-007 2.9529e-005 2.572e-005
H3F3A 8.6512 1.4627e-007 2.7999e-005 2.572e-005
LRP1 -8.6465 1.4658e-007 2.572e-005 2.572e-005
PEA15 -8.3256 1.8957e-007 3.0704e-005 3.0704e-005
ID2B -8.1183 2.281e-007 3.4307e-005 3.2051e-005
SFRS3 8.1166 2.2833e-007 3.2051e-005 3.2051e-005
HLA-DPA1 -7.8546 3.6101e-007 4.7509e-005 4.7509e-005
C5orf13 7.7195 4.8936e-007 6.0612e-005 5.827e-005
PTMA 7.7013 4.9812e-007 5.827e-005 5.827e-005
NAP1L1 7.674 5.3726e-007 5.9541e-005 5.8442e-005
HMGB2 7.6532 5.551e-007 5.8442e-005 5.8442e-005
ARAF -7.5549 6.1454e-007 6.1618e-005 5.929e-005
PTPRZ1 -7.5352 6.1948e-007 5.929e-005 5.929e-005
SPARCL1 -7.3639 8.1963e-007 7.5036e-005 7.3354e-005
FDR_BH
RAB31 6.9245e-005
PLEC1 6.9245e-005
HNRPA1 6.9245e-005
FCGR2A 6.9245e-005
PLEC1 6.9245e-005
FBL 6.9245e-005
KIAA0367 6.9245e-005
ID2B 6.9245e-005
RBMX 6.9245e-005
PAFAH1B3 6.9245e-005
H3F3A 6.9245e-005
LRP1 6.9245e-005
PEA15 8.2665e-005
ID2B 8.6292e-005
SFRS3 8.6292e-005
HLA-DPA1 0.00012791
C5orf13 0.00015688
PTMA 0.00015688
NAP1L1 0.00015734
HMGB2 0.00015734
ARAF 0.00015963
PTPRZ1 0.00015963
SPARCL1 0.00019749
A gene is considered to be differentially expressed between the two groups of samples if it shows both statistical and biological significance. In this demonstration, you will compare the gene expression ratio of MD over Mglio tumor samples. Therefore an up-regulated gene in this demo has higher expression in MD and down-regulate gene has higher expression in Mglio.
Plot the –log10 of p-values against the biological effect in a volcano plot. Note: From the volcano plot UI, you can interactively change the p-value cutoff and fold change limit, and export differentially expressed genes.
diffStruct = mavolcanoplot(MDData, MglioData, pvaluesCorr)
diffStruct =
Name: 'Differentially Expressed'
PVCutoff: 0.0500
FCThreshold: 2
GeneLabels: {327x1 cell}
PValues: [327x1 bioma.data.DataMatrix]
FoldChanges: [327x1 bioma.data.DataMatrix]
Ctrl-click genes in the gene lists to label the genes in the plot. As seen in the volcano plot, genes specific for neuronal based cerebella granule cells, such as ZIC and NEUROD, are found in the up-regulated gene list, while genes typical of the astrocytic and oligodendrocytic lineage and cell differentiation, such as SOX2, PEA15, and ID2B, are found in the down-regulated list.
Determine the number of differentially expressed genes.
nDiffGenes = diffStruct.PValues.NRows
nDiffGenes = 327
Get the list of up-regulated genes for MD compared to Mglio.
up_geneidx = find(diffStruct.FoldChanges > 0); up_genes = rownames(diffStruct.FoldChanges, up_geneidx); nUpGenes = length(up_geneidx)
nUpGenes = 225
Determine the number of down-regulated genes for MD compared to Mglio.
nDownGenes = sum(diffStruct.FoldChanges < 0)
nDownGenes = 102
Annotating Up-Regulated Genes Using Gene Ontology
Use Gene Ontology (GO) to annotate the differentially expressed genes. You can look at the up-regulated genes from the analysis above. Download the Homo sapiens annotations (gene_association.goa_human.gz file) from Gene Ontology Current Annotations, unzip, and store it in your the current directory.
Find the indices of the up-regulated genes for Gene Ontology analysis.
huGenes = rownames(expr_cns_gcrma_eb); for i = 1:nUpGenes up_geneidx(i) = find(strncmpi(huGenes, up_genes{i}, length(up_genes {i})), 1); end
Load the Gene Ontology database into a MATLAB object using the geneont function.
GO = geneont('live',true);
Read the Homo sapiens gene annotation file. For this demonstration, you will look only at genes that are related to molecular function, so you only need to read the information where the Aspect field is set to 'F'. The fields that are of interest are the gene symbol and associated ID. In GO Annotation files these have field names DB_Object_Symbol and GOid respectively.
HGann = goannotread('gene_association.goa_human',... 'Aspect','F','Fields',{'DB_Object_Symbol','GOid'});
Create a map between annotated genes and GO terms.
HGmap = containers.Map(); for i=1:numel(HGann) key = HGann(i).DB_Object_Symbol; if isKey(HGmap,key) HGmap(key) = [HGmap(key) HGann(i).GOid]; else HGmap(key) = HGann(i).GOid; end end
Determine the number of Homo sapiens annotated genes related to molecular function.
HGmap.Count
ans =
15561
Not all of the 5758 genes on the HuGeneFL chip are annotated. For every gene on the chip, see if it is annotated by comparing its gene symbol to the list of gene symbols from GO. Track the number of annotated genes and the number of up-regulated genes associated with each GO term. Note: You might get warnings about invalid or obsolete IDs due to the frequent update to the Homo sapiens gene annotation file.
m = GO.Terms(end).id; % gets the last term id chipgenesCount = zeros(m,1); % a vector of GO term counts for the entire chi p. upgenesCount = zeros(m,1); % a vector of GO term counts for up-regulated g enes. for i = 1:length(huGenes) if isKey(HGmap,huGenes{i}) goid = getrelatives(GO,HGmap(huGenes{i})); % Update the tally chipgenesCount(goid) = chipgenesCount(goid) + 1; if (any(i == up_geneidx)) upgenesCount(goid) = upgenesCount(goid) +1; end end end
Determine the statistically significant GO terms using the hypergeometric probability distribution. For each GO term, a p-value is calculated representing the probability that the number of annotated genes associated with it could have been found by chance.
gopvalues = hygepdf(upgenesCount,max(chipgenesCount),... max(upgenesCount),chipgenesCount); [dummy, idx] = sort(gopvalues); report = sprintf('GO Term p-value counts definition\n'); for i = 1:10 term = idx(i); report = sprintf('%s%s\t%-1.5f\t%3d / %3d\t%s...\n',... report, char(num2goid(term)), gopvalues(term),... upgenesCount(term), chipgenesCount(term),... GO(term).Term.definition(2:min(50,end))); end disp(report);
GO Term p-value counts definition GO:0005515 0.00000 119 / 2984 Interacting selectively with any prot ein or prote... GO:0003735 0.00000 56 / 151 The action of a molecule that contribu tes to the ... GO:0019843 0.00000 52 / 214 Interacting selectively with ribosomal RNA." [GOC... GO:0003723 0.00000 59 / 299 Interacting selectively with an RNA mo lecule or a... GO:0003729 0.00000 50 / 225 Interacting selectively with pre-messe nger RNA (p... GO:0008312 0.00000 48 / 208 Interacting selectively with 7S RNA, t he RNA comp... GO:0000049 0.00000 47 / 211 Interacting selectively with transfer RNA." [GOC:... GO:0000498 0.00000 46 / 204 Interacting selectively with ribonucle ic acid (RN... GO:0030515 0.00000 46 / 204 Interacting selectively with small nuc leolar RNA.... GO:0033204 0.00000 46 / 204 Interacting selectively with the RNA s ubunit of r...
Inspect the significant GO terms and select the terms related to specific molecule functions to build a sub-ontology that includes the ancestors of the terms. Visualize this ontology using the biograph function. You can also color the graphs nodes. In this example, the red nodes are the most significant, while the blue nodes are the least significant gene ontology terms. Note: The GO terms returned may differ from those shown due to the frequent update to the Homo sapiens gene annotation file.
fcnAncestors = GO(getancestors(GO,idx(1:5))) [cm acc rels] = getmatrix(fcnAncestors); BG = biograph(cm,get(fcnAncestors.Terms,'name')) for i=1:numel(acc) pval = gopvalues(acc(i)); color = [(1-pval).^(1) pval.^(1/8) pval.^(1/8)]; set(BG.Nodes(i),'Color',color); end view(BG)
Gene Ontology object with 9 Terms. Biograph object with 9 nodes and 8 edges.
Finding the Differentially Expressed Genes in Pathways
You can query the pathway information of the differentially expressed genes from the KEGG pathway database through KEGG's SOAP Web Service (For more information, see Connecting to the KEGG API Web Service), or by simply passing the list of gene symbols to KEGG's Web query tool.
Following are a few pathway maps with the genes in the up-regulated gene list highlighted:
References
[1] Pomeroy, S.L., Tamayo, P., Gaasenbeek, M., Sturla, L.M., Angelo, M., McLaughlin, M.E., Kim, J.Y., Goumnerova, L.C., Black, P.M., Lau, C., Allen, J.C., Zagzag, D., Olson, J.M., Curran, T., Wetmore, C., Biegel, J.A., Poggio, T., Mukherjee, S., Rifkin, R., Califano, A., Stolovitzky, G., Louis, DN, Mesirov, J.P., Lander, E.S., and Golub, T.R. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415(6870), 436-442.
[2] Storey, J.D., and Tibshirani, R. (2003). Statistical significance for genomewide studies. Proc.Nat.Acad.Sci., 100(16), 9440-9445.
[3] Dudoit, S., Shaffer, J.P., and Boldrick, J.C. (2003). Multiple hypothesis testing in microarray experiment. Statistical Science, 18, 71-103.
[4] Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc., B 57, 289-300.
Suggest an enhancement for this demo.

Free Computational Biology Interactive Kit
See how to analyze, visualize, and model biological data and systems using MathWorks products.
Get free kit