This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

Identifying Differentially Expressed Genes from RNA-Seq Data

This example shows how to test RNA-Seq data for differentially expressed genes using a negative binomial model.


A typical differential expression analysis of RNA-Seq data consists of normalizing the raw counts and performing statistical tests to reject or accept the null hypothesis that two groups of samples show no significant difference in gene expression. This example shows how to inspect the basic statistics of raw count data, how to determine size factors for count normalization and how to infer the most differentially expressed genes using a negative binomial model.

The dataset for this example comprises of RNA-Seq data obtained in the experiment described by Brooks et al. [1]. The authors investigated the effect of siRNA knock-down of pasilla, a gene known to play an important role in the regulation of splicing in Drosophila melanogaster. The dataset consists of 2 biological replicates of the control (untreated) samples and 2 biological replicates of the knock-down (treated) samples.

Inspecting Read Count Tables for Genomic Features

The starting point for this analysis of RNA-Seq data is a count matrix, where the rows correspond to genomic features of interest, the columns correspond to the given samples and the values represent the number of reads mapped to each feature in a given sample.

The included file pasilla_count_noMM.mat contains two tables with the count matrices at the gene level and at the exon level for each of the considered samples. You can obtain similar matrices using the function featurecount.

load pasilla_count_noMM.mat
% preview the table of read counts for genes
ans =

  10x6 table

         ID          Reference    untreated3    untreated4    treated2    treated3
    _____________    _________    __________    __________    ________    ________

    'FBgn0000003'      '3R'             0             1            1           2  
    'FBgn0000008'      '2R'           142           117          138         132  
    'FBgn0000014'      '3R'            20            12           10          19  
    'FBgn0000015'      '3R'             2             4            0           1  
    'FBgn0000017'      '3L'          6591          5127         4809        6027  
    'FBgn0000018'      '2L'           469           530          492         574  
    'FBgn0000024'      '3R'             5             6           10           8  
    'FBgn0000028'      'X'              0             0            2           1  
    'FBgn0000032'      '3R'          1160          1143         1138        1415  
    'FBgn0000036'      '3R'             0             0            0           1  

Note that when counting is performed without summarization, the individual features (exons in this case) are reported with their metafeature assignment (genes in this case) followed by the start and stop positions.

% preview the table of read counts for exons
ans =

  10x6 table

                  ID                   Reference    untreated3    untreated4    treated2    treated3
    _______________________________    _________    __________    __________    ________    ________

    'FBgn0000003_2648220_2648518'        '3R'             0             0            0           1  
    'FBgn0000008_18024938_18025756'      '2R'             0             1            0           2  
    'FBgn0000008_18050410_18051199'      '2R'            13             9           14          12  
    'FBgn0000008_18052282_18052494'      '2R'             4             2            5           0  
    'FBgn0000008_18056749_18058222'      '2R'            32            27           26          23  
    'FBgn0000008_18058283_18059490'      '2R'            14            18           29          22  
    'FBgn0000008_18059587_18059757'      '2R'             1             4            3           0  
    'FBgn0000008_18059821_18059938'      '2R'             0             0            2           0  
    'FBgn0000015_12758093_12760298'      '3R'             1             2            0           0  
    'FBgn0000017_16615461_16618374'      '3L'          1807          1572         1557        1702  

You can annotate and group the samples by creating a logical vector as follows:

samples = geneCountTable(:,3:end).Properties.VariableNames;
untreated = strncmp(samples,'untreated',length('untreated'))
treated = strncmp(samples,'treated',length('treated'))
untreated =

  1x4 logical array

   1   1   0   0

treated =

  1x4 logical array

   0   0   1   1

Plotting the Feature Assignments

The included file also contains a table geneSummaryTable with the summary of assigned and unassigned SAM entries. You can plot the basic distribution of the counting results by considering the number of reads that are assigned to the given genomic features (exons or genes for this example), as well as the number of reads that are unassigned (i.e. not overlapping any feature) or ambiguous (i.e. overlapping multiple features).

st = geneSummaryTable({'Assigned','Unassigned_ambiguous','Unassigned_noFeature'},:)
ylabel('Number of reads')
st =

  3x4 table

                            untreated3    untreated4     treated2      treated3 
                            __________    __________    __________    __________

    Assigned                1.5457e+07    1.6302e+07    1.5146e+07    1.8856e+07
    Unassigned_ambiguous    1.5708e+05    1.6882e+05    1.6194e+05    1.9977e+05
    Unassigned_noFeature    7.5455e+05    5.8309e+05    5.8756e+05    6.8356e+05

Note that a small fraction of the alignment records in the SAM files is not reported in the summary table. You can notice this in the difference between the total number of records in a SAM file and the total number of records processed during the counting procedure for that same SAM file. These unreported records correspond to the records mapped to reference sequences that are not annotated in the GTF file and therefore are not processed in the counting procedure. If the gene models account for all the reference sequences used during the read mapping step, then all records are reported in one of the categories of the summary table.

geneSummaryTable{'TotalEntries',:} - sum(geneSummaryTable{2:end,:})
ans =

       89516       95885       98207      104629

Plotting Read Coverage Across a Given Chromosome

When read counting is performed without summarization using the function featurecount, the default IDs are composed by the attribute or metafeature (by default, gene_id) followed by the start and the stop positions of the feature (by default, exon). You can use the exon start positions to plot the read coverage across any chromosome in consideration, for example chromosome arm 2L.

% consider chromosome arm 2L
chr2L = strcmp(exonCountTable.Reference,'2L');
exonCount = exonCountTable{:,3:end};

% retrieve exon start positions
exonStart = regexp(exonCountTable{chr2L,1},'_(\d+)_','tokens');
exonStart = [exonStart{:}];
exonStart = cellfun(@str2num, [exonStart{:}]');

% sort exon by start positions
[~,idx] = sort(exonStart);

% plot read coverage along the genomic coordinates
xlabel('Genomic position');
ylabel('Read count (exon level)');
title('Read count on Chromosome arm 2L');

% plot read coverage for each group separately
ylabel('Read count (exon level)');
title('Chromosome arm 2L (treated samples)');
ylabel('Read count (exon level)');
xlabel('Genomic position');
title('Chromosome arm 2L (untreated samples)');

Alternatively, you can plot the read coverage considering the starting position of each gene in a given chromosome. The file pasilla_geneLength.mat contains a table with the start and stop position of each gene in the corresponding gene annotation file.

% load gene start and stop position information
load pasilla_geneLength
ans =

  10x5 table

         ID            Name       Reference    Start    Stop 
    _____________    _________    _________    _____    _____

    'FBgn0037213'    'CG12581'       3R          380    10200
    'FBgn0000500'    'Dsk'           3R        15388    16170
    'FBgn0053294'    'CR33294'       3R        17136    21871
    'FBgn0037215'    'CG12582'       3R        23029    30295
    'FBgn0037217'    'CG14636'       3R        30207    41033
    'FBgn0037218'    'aux'           3R        37505    53244
    'FBgn0051516'    'CG31516'       3R        44179    45852
    'FBgn0261436'    'DhpD'          3R        53106    54971
    'FBgn0037220'    'CG14641'       3R        56475    58077
    'FBgn0015331'    'abs'           3R        58765    60763

% consider chromosome 3 ('Reference' is a categorical variable)
chr3 = (geneLength.Reference == '3L') | (geneLength.Reference == '3R');

% consider the counts for genes in chromosome 3
counts = geneCountTable{:,3:end};
[~,j,k] = intersect(geneCountTable{:,'ID'},geneLength{chr3,'ID'});
gstart = geneLength{k,'Start'};
gcounts = counts(j,:);

% sort according to ascending start position
[~,idx] = sort(gstart);

% plot read coverage by genomic position
plot(gstart(idx), gcounts(idx,treated),'.-r',...
	gstart(idx), gcounts(idx,untreated),'.-b');
xlabel('Genomic position')
ylabel('Read count');
title('Read count on Chromosome 3');
ans =


Normalizing Read Counts

The read count in RNA-Seq data has been found to be linearly related to the abundance of transcripts [2]. However, the read count for a given gene depends not only on the expression level of the gene, but also on the total number of reads sequenced and the length of the gene transcript. Therefore, in order to infer the expression level of a gene from the read count, we need to account for the sequencing depth and the gene transcript length. One common technique to normalize the read count is to use the RPKM (Read Per Kilobase Mapped) values, where the read count is normalized by the total number of reads yielded (in millions) and the length of each transcript (in kilobases). This normalization technique, however, is not always effective since few, very highly expressed genes can dominate the total lane count and skew the expression analysis.

A better normalization technique consists of computing the effective library size by considering a size factor for each sample. By dividing each sample's counts by the corresponding size factors, we bring all the count values to a common scale, making them comparable. Intuitively, if sample A is sequenced N times deeper than sample B, the read counts of non-differentially expressed genes are expected to be on average N times higher in sample A than in sample B, even if there is no difference in expression.

To estimate the size factors, take the median of the ratios of observed counts to those of a pseudo-reference sample, whose counts can be obtained by considering the geometric mean of each gene across all samples [3]. Then, to transform the observed counts to a common scale, divide the observed counts in each sample by the corresponding size factor.

% estimate pseudo-reference with geometric mean row by row
pseudoRefSample = geomean(counts,2);
nz = pseudoRefSample > 0;
ratios = bsxfun(@rdivide,counts(nz,:),pseudoRefSample(nz));
sizeFactors = median(ratios,1)
sizeFactors =

    0.9374    0.9725    0.9388    1.1789

% transform to common scale
normCounts = bsxfun(@rdivide,counts,sizeFactors);
ans =

   1.0e+03 *

         0    0.0010    0.0011    0.0017
    0.1515    0.1203    0.1470    0.1120
    0.0213    0.0123    0.0107    0.0161
    0.0021    0.0041         0    0.0008
    7.0315    5.2721    5.1225    5.1124
    0.5003    0.5450    0.5241    0.4869
    0.0053    0.0062    0.0107    0.0068
         0         0    0.0021    0.0008
    1.2375    1.1753    1.2122    1.2003
         0         0         0    0.0008

You can appreciate the effect of this normalization by using the function boxplot to represent statistical measures such as median, quartiles, minimum and maximum.


maboxplot(log2(counts),'title','Raw Read Count','orientation','horizontal')

maboxplot(log2(normCounts),'title','Normalized Read Count','orientation','horizontal')

Computing Mean, Dispersion and Fold Change

In order to better characterize the data, we consider the mean and the dispersion of the normalized counts. The variance of read counts is given by the sum of two terms: the variation across samples (raw variance) and the uncertainty of measuring the expression by counting reads (shot noise or Poisson). The raw variance term dominates for highly expressed genes, whereas the shot noise dominates for lowly expressed genes. You can plot the empirical dispersion values against the mean of the normalized counts in a log scale as shown below.

% consider the mean
meanTreated = mean(normCounts(:,treated),2);
meanUntreated = mean(normCounts(:,untreated),2);

% consider the dispersion
dispTreated = std(normCounts(:,treated),0,2) ./ meanTreated;
dispUntreated = std(normCounts(:,untreated),0,2) ./ meanUntreated;

% plot on a log-log scale
hold on;

Given the small number of replicates, it is not surprising to expect that the dispersion values scatter with some variance around the true value. Some of this variance reflects sampling variance and some reflects the true variability among the gene expressions of the samples.

You can look at the difference of the gene expression among two conditions, by calculating the fold change (FC) for each gene, i.e. the ratio between the counts in the treated group over the counts in the untreated group. Generally these ratios are considered in the log2 scale, so that any change is symmetric with respect to zero (e.g. a ratio of 1/2 or 2/1 corresponds to -1 or +1 in the log scale).

% compute the mean and the log2FC
meanBase = (meanTreated + meanUntreated) / 2;
foldChange = meanTreated ./ meanUntreated;
log2FC = log2(foldChange);

% plot mean vs. fold change (MA plot)
mairplot(meanTreated, meanUntreated,'Type','MA','Plotonly',true);
set(get(gca,'Xlabel'),'String','mean of normalized counts')
set(get(gca,'Ylabel'),'String','log2(fold change)')
Warning: Zero values are ignored 

It is possible to annotate the values in the plot with the corresponding gene names, interactively select genes, and export gene lists to the workspace by calling the mairplot function as illustrated below:

Warning: Zero values are ignored 

It is convenient to store the information about the mean value and fold change for each gene in a table. You can then access information about a given gene or a group of genes satisfying specific criteria by indexing the table by gene names.

% create table with statistics about each gene
geneTable = table(meanBase,meanTreated,meanUntreated,foldChange,log2FC);
geneTable.Properties.RowNames = geneCountTable.ID;
% summary

    meanBase: 11609x1 double


            Min          0.21206
            Median        201.24
            Max       2.6789e+05

    meanTreated: 11609x1 double


            Min                 0  
            Median         201.54  
            Max        2.5676e+05  

    meanUntreated: 11609x1 double


            Min                  0   
            Median          199.44   
            Max         2.7903e+05   

    foldChange: 11609x1 double


            Min               0   
            Median      0.99903   
            Max             Inf   

    log2FC: 11609x1 double


            Min            -Inf
            Median    -0.001406
            Max             Inf

% preview
ans =

  10x5 table

                   meanBase    meanTreated    meanUntreated    foldChange      log2FC   
                   ________    ___________    _____________    __________    ___________

    FBgn0000003     0.9475        1.3808         0.51415         2.6857           1.4253
    FBgn0000008     132.69        129.48           135.9        0.95277        -0.069799
    FBgn0000014     15.111        13.384          16.838        0.79488         -0.33119
    FBgn0000015     1.7738       0.42413          3.1234        0.13579          -2.8806
    FBgn0000017     5634.6        5117.4          6151.8        0.83186         -0.26559
    FBgn0000018     514.08        505.48          522.67        0.96711        -0.048243
    FBgn0000024     7.2354        8.7189           5.752         1.5158          0.60009
    FBgn0000028    0.74465        1.4893               0            Inf              Inf
    FBgn0000032     1206.3        1206.2          1206.4        0.99983      -0.00025093
    FBgn0000036    0.21206       0.42413               0            Inf              Inf

% access information about a specific gene
myGene = 'FBgn0261570';

% access information about a given gene list
myGeneSet = {'FBgn0261570','FBgn0261573','FBgn0261575','FBgn0261560'};
ans =

  1x5 table

                   meanBase    meanTreated    meanUntreated    foldChange    log2FC 
                   ________    ___________    _____________    __________    _______

    FBgn0261570     4435.5       4939.1          3931.8          1.2562      0.32907

ans =

  1x2 table

                   meanBase    log2FC 
                   ________    _______

    FBgn0261570     4435.5     0.32907

ans =

  4x5 table

                   meanBase    meanTreated    meanUntreated    foldChange    log2FC 
                   ________    ___________    _____________    __________    _______

    FBgn0261570     4435.5       4939.1          3931.8          1.2562      0.32907
    FBgn0261573     2936.9       2954.8          2919.1          1.0122      0.01753
    FBgn0261575     4.3776       5.6318          3.1234          1.8031      0.85047
    FBgn0261560     2041.1       1494.3            2588         0.57738      -0.7924

Inferring Differential Expression with a Negative Bionomial Model

Determining whether the gene expressions in two conditions are statistically different consists of rejecting the null hypothesis that the two data samples come from distributions with equal means. This analysis assumes the read counts are modeled according to a negative bionomial distribution (as proposed in [3]). The function nbintest performs this type of hypothesis testing with three possible options to specify the type of linkage between the variance and the mean.

By specifying the link between variance and mean as an identity, we assume the variance is equal to the mean, and the counts are modeled by the Poisson distribution [4].

tIdentity = nbintest(counts(:,treated),counts(:,untreated),'VarianceLink','Identity');
h = plotVarianceLink(tIdentity);

% set custom title
h(1).Title.String = 'Variance Link on Treated Samples';
h(2).Title.String = 'Variance Link on Untreated Samples';

Alternatively, by specifying the variance as the sum of the shot noise term (i.e. mean) and a constant multiplied by the squared mean, the counts are modeled according to a distribution described in [5]. The constant term is estimated using all the rows in the data.

tConstant = nbintest(counts(:,treated),counts(:,untreated),'VarianceLink','Constant');
h = plotVarianceLink(tConstant);

% set custom title
h(1).Title.String = 'Variance Link on Treated Samples';
h(2).Title.String = 'Variance Link on Untreated Samples';

Finally, by considering the variance as the sum of the shot noise term (i.e. mean) and a locally regressed non-parametric smooth function of the mean, the counts are modeled according to the distribution proposed in [3].

tLocal = nbintest(counts(:,treated),counts(:,untreated),'VarianceLink','LocalRegression');
h = plotVarianceLink(tLocal);

% set custom title
h(1).Title.String = 'Variance Link on Treated Samples';
h(2).Title.String = 'Variance Link on Untreated Samples';

In order to evaluate which fit is the best for the data in consideration, you can compare the fitting curves in a single plot, as shown below.

h = plotVarianceLink(tLocal,'compare',true);

% set custom title
h(1).Title.String = 'Variance Link on Treated Samples';
h(2).Title.String = 'Variance Link on Untreated Samples';

The output of nbintest includes a vector of P-values. A P-value indicates the probability that a change in expression as strong as the one observed (or even stronger) would occur under the null hypothesis, i.e. the conditions have no effect on gene expression. In the histogram of the P-values we observe an enrichment of low values (due to differentially expressed genes), whereas other values are uniformly spread (due to non-differentially expressed genes). The enrichment of values equal to 1 are due to genes with very low counts.

title('P-value enrichment')

Filter out those genes with relatively low count to observe a more uniform spread of non-significant P-values across the range (0,1]. Note that this does not affect the distribution of significant P-values.

lowCountThreshold = 10;
lowCountGenes = all(counts < lowCountThreshold, 2);
title('P-value enrichment without low count genes')

Multiple Testing and Adjusted P-values

Thresholding P-values to determine what fold changes are more significant than others is not appropriate for this type of data analysis, due to the multiple testing problem. While performing a large number of simultaneous tests, the probability of getting a significant result simply due to chance increases with the number of tests. In order to account for multiple testing, perform a correction (or adjustment) of the P-values so that the probability of observing at least one significant result due to chance remains below the desired significance level.

The Benjamini-Hochberg (BH) adjustment [6] is a statistical method that provides an adjusted P-value answering the following question: what would be the fraction of false positives if all the genes with adjusted P-values below a given threshold were considered significant? Set a threshold of 0.1 for the adjusted P-values, equivalent to consider a 10% false positives as acceptable, and identify the genes that are significantly expressed by considering all the genes with adjusted P-values below this threshold.

% compute the adjusted P-values (BH correction)
padj = mafdr(tLocal.pValue,'BHFDR',true);

% add to the existing table
geneTable.pvalue = tLocal.pValue;
geneTable.padj = padj;

% create a table with significant genes
sig = geneTable.padj < 0.1;
geneTableSig = geneTable(sig,:);
geneTableSig = sortrows(geneTableSig,'padj');
numberSigGenes = size(geneTableSig,1)
numberSigGenes =


Identifying the Most Up-regulated and Down-regulated Genes

You can now identify the most up-regulated or down-regulated genes by considering an absolute fold change above a chosen cutoff. For example, a cutoff of 1 in log2 scale yields the list of genes that are up-regulated with a 2 fold change.

% find up-regulated genes
up = geneTableSig.log2FC > 1;
upGenes = sortrows(geneTableSig(up,:),'log2FC','descend');
numberSigGenesUp = sum(up)

% display the top 10 up-regulated genes
top10GenesUp = upGenes(1:10,:)

% find down-regulated genes
down = geneTableSig.log2FC < -1;
downGenes = sortrows(geneTableSig(down,:),'log2FC','ascend');
numberSigGenesDown = sum(down)

% find top 10 down-regulated genes
top10GenesDown = downGenes(1:10,:)
numberSigGenesUp =


top10GenesUp =

  10x7 table

                   meanBase    meanTreated    meanUntreated    foldChange    log2FC      pvalue         padj   
                   ________    ___________    _____________    __________    ______    __________    __________

    FBgn0030173     3.3979       6.7957               0             Inf         Inf     0.0063115      0.047764
    FBgn0036822     3.1364       6.2729               0             Inf         Inf      0.012203      0.079274
    FBgn0052548      8.158       15.269          1.0476          14.575      3.8654    0.00016945     0.0022662
    FBgn0050495     6.8315       12.635          1.0283          12.287      3.6191     0.0018949      0.017972
    FBgn0063667     20.573       38.042          3.1042          12.255      3.6153    8.5037e-08    2.3845e-06
    FBgn0033764     91.969       167.61          16.324          10.268      3.3601    1.8345e-25    2.9174e-23
    FBgn0037290     85.845       155.46          16.228          9.5801        3.26    3.5583e-23    4.6941e-21
    FBgn0033733     7.4634       13.384          1.5424          8.6773      3.1172     0.0027276      0.024283
    FBgn0037191     7.1766       12.753          1.6003          7.9694      2.9945     0.0047803      0.038193
    FBgn0033943       6.95       12.319           1.581          7.7921       2.962     0.0053635      0.041986

numberSigGenesDown =


top10GenesDown =

  10x7 table

                   meanBase    meanTreated    meanUntreated    foldChange    log2FC       pvalue         padj   
                   ________    ___________    _____________    __________    _______    __________    __________

    FBgn0053498     15.469             0         30.938                0        -Inf    9.8404e-11     4.345e-09
    FBgn0259236     6.8092             0         13.618                0        -Inf    1.5526e-05    0.00027393
    FBgn0052500     4.3703             0         8.7405                0        -Inf    0.00066783     0.0075343
    FBgn0039331     3.6954             0         7.3908                0        -Inf     0.0019558      0.018474
    FBgn0040697      3.419             0         6.8381                0        -Inf     0.0027378      0.024336
    FBgn0034972     2.9145             0         5.8291                0        -Inf     0.0068564       0.05073
    FBgn0040967     2.6382             0         5.2764                0        -Inf     0.0096039      0.065972
    FBgn0031923     2.3715             0         4.7429                0        -Inf      0.016164      0.098762
    FBgn0085359     62.473        2.9786         121.97         0.024421     -5.3557    5.5813e-33    1.5068e-30
    FBgn0004854     7.4674       0.53259         14.402          0.03698     -4.7571    8.1587e-05     0.0012034

A good visualization of the gene expressions and their significance is given by plotting the fold change versus the mean in log scale and coloring the data points according to the adjusted P-values.

ylabel('log2(Fold Change)')
xlabel('log2(Mean of normalized counts)')
title('Fold change by FDR')

You can see here that for weakly expressed genes (i.e. those with low means), the FDR is generally high because low read counts are dominated by Poisson noise and consequently any biological variability is drowned in the uncertainties from the read counting.


[1] Brooks et al. Conservation of an RNA regulatory map between Drosophila and mammals. Genome Research 2011. 21:193-202.

[2] Mortazavi et al. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 2008. 5:621-628.

[3] Anders et al. Differential expression analysis for sequence count data. Genome Biology 2010. 11:R106.

[4] Marioni et al. RNA-Seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research 2008. 18:1509-1517.

[5] Robinson et al. Moderated statistical test for assessing differences in tag abundance. Bioinformatics 2007. 23(21):2881-2887.

[6] Benjamini et al. Controlling the false discovery rate: a practical and powerful approach to multiple testing. 1995. Journal of the Royal Statistical Society, Series B57 (1):289-300.

Was this topic helpful?