Identifying Differentially Expressed Genes from RNA-Seq Data

This example shows how to load RNA-seq data and test for differential expression using a negative binomial model.


RNA-seq is an emerging technology for surveying gene expression and transcriptome content by directly sequencing the mRNA molecules in a sample. RNA-seq can provide gene expression measurements and is regarded as an attractive approach to analyze a transcriptome in an unbiased and comprehensive manner.

In this example, you will use Bioinformatics Toolbox™ and Statistics and Machine Learning Toolbox™ functions to load publicly available transcriptional profiling sequencing data into MATLAB®, compute the digital gene expression, and then identify differentially expressed genes in RNA-seq data from hormone treated prostate cancer cell line samples [1].

The Prostate Cancer Data Set

In the prostate cancer study, the prostate cancer cell line LNCap was treated with androgen/DHT. Mock-treated and androgen-stimulated LNCap cells were sequenced using the Illumina® 1G Genome Analyzer [1]. For the mock-treated cells, there were four lanes totaling ~10 million reads. For the DHT-treated cells, there were three lanes totaling ~7 million reads. All replicates were technical replicates. Samples labeled s1 through s4 are from mock-treated cells. Samples labeled s5, s6, and s8 are from DHT-treated cells. The read sequences are stored in FASTA files. The sequence IDs break down as follows: seq_(unique sequence id)_(number of times this sequence was seen in this lane).

This example assumes that you have:

(1) Downloaded and uncompressed the seven FASTA files (s1.fa, s2.fa, s3.fa, s4.fa, s5.fa, s6.fa and s8.fa) containing the raw, 35bp, unmapped short reads from the author's Web Site.

(2) Produced a SAM-formatted file for each of the seven FASTA files by mapping the short reads to the NCBI version 37 of the human genome using a mapper such as Bowtie [2],

(3) Ordered the SAM-formatted files by reference name first, then by genomic position.

For the published version of this example, 4,388,997 short reads were mapped using the Bowtie aligner [2]. The aligner was instructed to report one best valid alignment. No more than two mismatches were allowed for alignment. Reads with more than one reportable alignment were suppressed, i.e. any read that mapped to multiple locations was discarded. The alignment was output to seven SAM files (s1.sam, s2.sam, s3.sam, s4.sam, s5.sam, s6.sam and s8.sam). Because the input files were FASTA files, all quality values were assumed to be 40 on the Phred quality scale [2]. We then used SAMtools [3] to sort the mapped reads in the seven SAM files, one for each replicate.

Creating an Annotation Object of Target Genes

Download from Ensembl a tab-separated-value (TSV) table with all protein encoding genes to a text file, ensemblmart_genes_hum37.txt. For this example, we are using Ensembl release 64. Using Ensembl's BioMart service, you can select a table with the following attributes: chromosome name, gene biotype, gene name, gene start/end, and strand direction.

Use the provided helper function ensemblmart2gff to convert the downloaded TSV file to a GFF formatted file. Then use GFFAnnotation to load the file into MATLAB.

GFFfilename = ensemblmart2gff('ensemblmart_genes_hum37.txt');
genes = GFFAnnotation(GFFfilename)
genes = 

  GFFAnnotation with properties:

    FieldNames: {1x9 cell}
    NumEntries: 21184

Create a subset with the genes present in chromosomes only (without contigs). The GFFAnnotation object contais 20012 annotated protein-coding genes in the Ensembl database.

chrs = {'1','2','3','4','5','6','7','8','9','10','11','12','13','14',...
genes = getSubset(genes,'reference',chrs)
genes = 

  GFFAnnotation with properties:

    FieldNames: {1x9 cell}
    NumEntries: 20012

Copy the gene information into a structure and display the first entry.

ans = 

     Reference: '1'
         Start: 205111632
          Stop: 205180727
       Feature: 'DSTYK'
        Source: 'protein_coding'
         Score: '0.0'
        Strand: '-'
         Frame: '.'
    Attributes: ''

Importing Mapped Short Read Alignment Data

The size of the sorted SAM files in this data set are in the order of 250-360MB. You can access the mapped reads in s1.sam by creating a BioMap. BioMap has an interface that provides direct access to the mapped short reads stored in the SAM-formatted file, thus minimizing the amount of data that is actually loaded into memory.

bm = BioMap('s1.sam')
bm = 

  BioMap with properties:

    SequenceDictionary: {1x25 cell}
             Reference: [458367x1 File indexed property]
             Signature: [458367x1 File indexed property]
                 Start: [458367x1 File indexed property]
        MappingQuality: [458367x1 File indexed property]
                  Flag: [458367x1 File indexed property]
          MatePosition: [458367x1 File indexed property]
               Quality: [458367x1 File indexed property]
              Sequence: [458367x1 File indexed property]
                Header: [458367x1 File indexed property]
                 NSeqs: 458367
                  Name: ''

Use the getSummary method to obtain a list of the existing references and the actual number of short read mapped to each one. Observe that the order of the references is equivalent to the previously created cell string chrs.

BioMap summary:
                                  Name: ''
                        Container_Type: 'Data is file indexed.'
             Total_Number_of_Sequences: 458367
    Number_of_References_in_Dictionary: 25

                                      Number_of_Sequences    Genomic_Range      
    gi|224589800|ref|NC_000001.10|    39037                    564571  249213991
    gi|224589811|ref|NC_000002.11|    23102                     39107  243177977
    gi|224589815|ref|NC_000003.11|    23788                    578280  197769619
    gi|224589816|ref|NC_000004.11|    16273                     56044  190988830
    gi|224589817|ref|NC_000005.9|     20875                     50342  180698591
    gi|224589818|ref|NC_000006.11|    16743                    277774  170892222
    gi|224589819|ref|NC_000007.13|    17022                    146474  158834423
    gi|224589820|ref|NC_000008.10|    12199                    162668  146284742
    gi|224589821|ref|NC_000009.11|    13988                     21790  141067447
    gi|224589801|ref|NC_000010.10|    15707                    179281  135500747
    gi|224589802|ref|NC_000011.9|     37506                    203411  134375386
    gi|224589803|ref|NC_000012.11|    21714                     79745  133785475
    gi|224589804|ref|NC_000013.10|     6078                  19335895  115091858
    gi|224589805|ref|NC_000014.8|     14644                  19123810  107260517
    gi|224589806|ref|NC_000015.9|     13199                  20145084  102501644
    gi|224589807|ref|NC_000016.9|     15423                     92212   90143169
    gi|224589808|ref|NC_000017.10|    22089                     56680   81014350
    gi|224589809|ref|NC_000018.9|      5986                    111538   77957293
    gi|224589810|ref|NC_000019.9|     17690                     63006   59093541
    gi|224589812|ref|NC_000020.10|    10026                    119233   62906673
    gi|224589813|ref|NC_000021.8|      6119                   9421584   48085597
    gi|224589814|ref|NC_000022.10|     7366                  16150315   51216589
    gi|224589822|ref|NC_000023.10|    12939                   2774622  154563685
    gi|224589823|ref|NC_000024.9|      2819                   2711686   59032821
    gi|17981852|ref|NC_001807.4|      66035                        12      16570

You can access the alignments, and perform operations like getting counts and coverage from bm. For more examples of getting read coverage at the chromosome level, see Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data.

Determining Digital Gene Expression

Next, you will determine the mapped reads associated with each Ensembl gene. Because the strings used in the SAM files to denote the reference names are different to those provided in the annotations, we find a vector with the reference index for each gene:

geneReference =  seqmatch(genes.Reference,chrs,'exact',true);

For each gene, count the mapped reads that overlap any part of the gene. The read counts for each gene are the digital gene expression of that gene. Use the getCounts method of a BioMap to compute the read count within a specified range.

counts = getCounts(bm,genes.Start,genes.Stop,1:genes.NumEntries,geneReference);

Gene expression levels can be best respresented by a table, with each row representing a gene. Create a table with two columns, set the first column to the gene symbols and second column to the counts of the first sample.

filenames = {'s1.sam','s2.sam','s3.sam','s4.sam','s5.sam','s6.sam','s8.sam'};
samples =  {'Mock_1','Mock_2','Mock_3','Mock_4','DHT_1','DHT_2','DHT_3'};

lncap = table(genes.Feature,counts,'VariableNames',{'Gene',samples{1}});

Display the counts for the first ten genes.

ans = 

        Gene        Mock_1
    ____________    ______

    'DSTYK'         21    
    'KCNJ2'          1    
    'DPF3'           2    
    'KRT78'          0    
    'GPR19'          1    
    'SOX9'           8    
    'C17orf63'      13    
    'AL929472.1'     0    
    'INPP5B'        19    
    'NME4'          10    

Determine the number of genes that have counts greater than or equal to 50 in chromosome 1.

lichr1 = geneReference == 1;  % logical index to genes in chromosome 1
sum(lncap.Mock_1>=50 & lichr1)
ans =


Repeat this step for the other six samples (SAM files) in the data set to get their gene counts and copy the information to the previously created table.

for i = 2:7
    bm = BioMap(filenames{i});
    counts = getCounts(bm,genes.Start,genes.Stop,1:genes.NumEntries,geneReference);
	lncap.(samples{i}) = counts;

Inspect the first 10 rows in the table with the counts for all seven samples.

lncap(1:10, :)
ans = 

        Gene        Mock_1    Mock_2    Mock_3    Mock_4    DHT_1    DHT_2
    ____________    ______    ______    ______    ______    _____    _____

    'DSTYK'         21        15        15        24        24       24   
    'KCNJ2'          1         0         2         0         0        2   
    'DPF3'           2         2         2         2         2        1   
    'KRT78'          0         0         0         0         0        0   
    'GPR19'          1         2         1         1         0        0   
    'SOX9'           8        13        19        15        27       22   
    'C17orf63'      13        12        16        24        19       12   
    'AL929472.1'     0         0         0         1         0        0   
    'INPP5B'        19        23        27        24        35       32   
    'NME4'          10        11        14        22        11       20   



The table lncap contains counts for samples from two biological conditions: mock-treated (Aidx) and DHT-treated (Bidx).

Aidx = logical([1 1 1 1 0 0 0]);
Bidx = logical([0 0 0 0 1 1 1]);

You can plot the counts for a chromosome along the chromosome genome coordinate. For example, plot the counts for chromosome 1 for mock-treated sample Mock_1 and DHT-treated sample DHT_1. Add the ideogram for chromosome 1 to the plot using the chromosomeplot function.

ichr1 = find(lichr1);  % linear index to genes in chromosome 1
[~,h] = sort(genes.Start(ichr1));
ichr1 = ichr1(h);      % linear index to genes in chromosome 1 sorted by
                       % genomic position

plot(genes.Start(ichr1), lncap{ichr1,'Mock_1'}, '.-r',...
     genes.Start(ichr1), lncap{ichr1,'DHT_1'}, '.-b');
ylabel('Gene Counts')
title('Gene Counts on Chromosome 1')
fixGenomicPositionLabels(gca)  % formats tick labels and adds datacursors
chromosomeplot('hs_cytoBand.txt', 1, 'AddToPlot', gca)

Inference of Differential Signal in RNA Expression

For RNA-seq experiments, the read counts have been found to be linearly related to the abundance of the target transcripts [4]. The interest lies in comparing the read counts between different biological conditions. Current observations suggest that typical RNA-seq experiments have low background noise, and the gene counts are discrete and could follow the Poisson distribution. While it has been noted that the assumption of the Poisson distribution often predicts smaller variation in count data by ignoring the extra variation due to the actual differences between replicate samples [5]. Anders,(2010) proposed an error model for statistical inference of differential signal in RNA-seq expression data that could address the overdispersion problem [6]. Their approach uses the negative binomial distribution to model the null distribution of the read counts. The mean and variance of the negative binomial distribution are linked by local regression, and these two parameters can be reliably estimated even when the number of replicates is small [6].

In this example, you will apply this statistical model to process the count data and test for differential expression. The details of the algorithm can be found in reference [6]. The model of Anders, (2010) has three sets of parameters that need to be estimated from the data:

1. Library size parameters;

2. Gene abundance parameters under each experimental condition;

3. The smooth functions that model the dependence of the raw variance on the expected mean.

Estimating Library Size Factor

The expectation values of all gene counts from a sample are proportional to the sample's library size. The effective library size can be estimated from the count data.

Compute the geometric mean of the gene counts (rows in lncap) across all samples in the experiment as a pseudo-reference sample.

pseudo_ref_sample = geomean(lncap{:,samples},2);

Each library size parameter is computed as the median of the ratio of the sample's counts to those of the pseudo-reference sample.

nzi = pseudo_ref_sample>0; % ignore genes with zero geometric mean
ratios = bsxfun(@rdivide, lncap{nzi,samples}, pseudo_ref_sample(nzi));
sizeFactors = median(ratios, 1);

The counts can be transformed to a common scale using size factor adjustment.

base_lncap = lncap;
base_lncap{:,samples} = bsxfun(@rdivide,lncap{:,samples},sizeFactors);

Use the boxplot function to inspect the count distribution of the mock-treated and DHT-treated samples and the size factor adjustment.

maboxplot(log2(lncap{:,samples}), 'title','Raw Read Counts',...
                                  'orientation', 'horizontal')
maboxplot(log2(base_lncap{:,samples}), 'title','Size Factor Adjusted Read Counts',...
                                       'orientation', 'horizontal')

Estimate the gene abundance

To estimate the gene abundance for each experimental condition (mock-treated (A) and DHT-treated (B)) you use the average of the counts from the samples transformed to the common scale. (Eq. 6 in [6])

mean_A = mean(base_lncap{:,samples(Aidx)}, 2);
mean_B = mean(base_lncap{:,samples(Bidx)}, 2);

Plot the log2 fold changes against the base means using the mairplot function. A quick exploration reflects ~15 differentially expressed genes (20 fold change or more), though not all of these are significant due to the low number of counts compared to the sample variance.


Estimating Negative Binomial Distribution Parameters

In the model, the variances of the counts of a gene are considered as the sum of a shot noise term and a raw variance term. The shot noise term is the mean counts of the gene, while the raw variance can be predicted from the mean, i.e., genes with a similar expression level have similar variance across the replicates (samples of the same biological condition). A smooth function that models the dependence of the raw variance on the mean is obtained by fitting the sample mean and variance within replicates for each gene using local regression function.

Compute sample variances transformed to the common scale for mock-treated samples. (Eq. 7 in [6])

var_A = var(base_lncap{:,samples(Aidx)}, 0, 2);

Estimate the shot noise term. (Eq. 8 in [6])

z = mean_A * mean(1./sizeFactors(Aidx));

The helper function estimateNBVarFunc returns an anonymous function that maps the mean estimate to an unbiased raw variance estimate. Bias adjustment due to shot noise and multiple replicates is considered in the anonymous function.

raw_var_func_A = estimateNBVarFunc(mean_A,var_A,sizeFactors(Aidx))
raw_var_func_A = 


Use the anonymous function raw_var_func_A to calculate the sample variance by adding the shot noise bias term to the raw variance. (Eq.9 in [6])

var_fit_A =  raw_var_func_A(mean_A) + z;

Plot the sample variance to its regressed value to check the fit of the variance function.

loglog(mean_A, var_A, '*')
hold on
loglog(mean_A, var_fit_A, '.r')
ylabel('Base Variances')
xlabel('Base Means')
title('Dependence of the Variance on the Mean for Mock-Treated Samples')

The fit (red line) follows the single-gene estimates well, even though the spread of the latter is considerable, as one would expect, given that each raw variance value is estimated from only four values (four mock-treaded replicates).

Empirical Cumulative Distribution Functions

As RNA-seq experiments typically have few replicates, the single-gene estimate of the base variance can deviate wildly from the fitted value. To see whether this might be too wild, the cumulative probability for the ratio of single-gene estimate of the base variance to the fitted value is calculated from the chi-square distribution, as explained in reference [6].

Compute the cumulative probabilities of the variance ratios of mock-treated samples.

degrees_of_freedom = sum(Aidx) - 1;
var_ratio = var_A ./ var_fit_A;
pchisq = chi2cdf(degrees_of_freedom * var_ratio, degrees_of_freedom);

Compute the empirical cumulative density functions (ECDF) stratified by base count levels, and show the ECDFs curves. Group the counts into seven levels.

count_levels = [0 3 12 30 65 130 310];
labels = {'0-3','4-12','13-30','31-65','66-130','131-310','> 311'};
grps = sum(bsxfun(@ge,mean_A,count_levels),2); % stratification
hold on
cm = jet(7);
for i = 1:7
   [Y1,X1] = ecdf(pchisq(grps==i));
plot([0,1],[0,1] ,'k', 'linewidth', 2)
ax = gca;
ax.Box = 'on';
xlabel('Chi-squared probability of residual')
title('Residuals ECDF plot for mock-treated samples')

The ECDF curves of count levels greater than 3 and below 130 follows the diagonal well (black line). If the ECDF curves are below the black line, variance is underestimated. If the ECDF curves are above the black line, variance is overestimated [6]. For very low counts (below 3), the deviations become stronger, but at these levels, shot noise dominates. For the high count cases, the variance is overestimated. The reason might be there are not enough genes with high counts. Get the number of genes in each of the count levels.

ans = 


    0-3        8984  
    4-12       3405  
    13-30      3481  
    31-65      2418  
    66-130     1173  
    131-310     428  
    > 311       123  

Increasing the sequence depth, which in turn increases the number of genes with higher counts, improves the variance estimation.

Testing for Differential Expression

Having estimated and verified the mean-variance dependence, you can test for differentially expressed genes between the samples from the mock- and DHT- treated conditions. Define, as test statistic, the total counts in each condition, k_A and k_B:

k_A = sum(lncap{:, samples(Aidx)}, 2);
k_B = sum(lncap{:, samples(Bidx)}, 2);

Parameters of the new negative binomial distributions for count sums k_A can be calculated by Eqs. 12-14 in [6]:

pooled_mean = mean(lncap{:, samples},2);
mean_k_A = pooled_mean * sum(sizeFactors(Aidx));
var_k_A = mean_k_A + raw_var_func_A(pooled_mean) * sum(sizeFactors(Aidx).^2);

Repeat the same process for k_B:

var_B = var(base_lncap{:,samples(Bidx)}, 0, 2);
raw_var_func_B = estimateNBVarFunc(mean_B,var_B, sizeFactors(Bidx));
mean_k_B = pooled_mean *sum(sizeFactors(Bidx));
var_k_B = mean_k_B + raw_var_func_B(pooled_mean) * sum(sizeFactors(Bidx).^2);

Compute the p-values for the statistical significance of the change from DHT-treated condition to mock-treated condition. The helper function computePVal implements the numerical computation of the p-values presented in the reference [6].

res = table(genes.Feature,'VariableNames',{'Gene'});
res.pvals = computePVal(k_B, mean_k_B, var_k_B, k_A, mean_k_A, var_k_A);

You can empirically adjust the p-values from the multiple tests for false discovery rate (FDR) with the Benjamini-Hochberg procedure [7] using the mafdr function.

res.p_fdr = mafdr(res.pvals, 'BHFDR', true);

Determine the fold change estimated from the DHT-treated to the mock-treated condition.

fold_change = mean_B ./ mean_A;

Determine the base 2 logarithm of the fold change.

res.log2_fold_change = log2(fold_change);

Plot the log2 fold changes against the base means, and color those genes with p-values.

scatter(log2(pooled_mean), res.log2_fold_change,3,(res.p_fdr).^(.02),'o')
xlabel('log2 Mean')
ylabel('log2 Fold Change')
hc = colorbar;
hc.YTickLabel = num2str((get(hc,'Ytick').^50)','%6.1g');
title('Fold Change colored by False Discovery Rate (FDR)')

You can identify up- or down- regulated genes for mean base count levels over 3.

up_idx = find(res.p_fdr < 0.01 & res.log2_fold_change >= 2 & pooled_mean > 3 );
ans =


down_idx = find(res.p_fdr < 0.01 & res.log2_fold_change <= -2 & pooled_mean > 3 );
ans =


This analysis identified 375 statistically significant (out of 20,012 genes) that were differentially up- or down- regulated by hormone treatment. You can sort table res by statistical significant and display the top list.

[~,h] = sort(res.p_fdr);
ans = 

      Gene          pvals          p_fdr       log2_fold_change
    _________    ___________    ___________    ________________

    'FKBP5'                0              0     5.0449         
    'NCAPD3'               0              0     5.4914         
    'CENPN'      6.6707e-300    4.4498e-296     4.8519         
    'LIFR'       2.4939e-284    1.2477e-280     4.0734         
    'DHCR24'     2.0847e-249    8.3437e-246     3.1845         
    'ERRFI1'     9.2602e-246    3.0886e-242     4.0914         
    'GLYATL2'    8.5613e-244    2.4475e-240     3.4522         
    'ACSL3'      2.6073e-225    6.5221e-222     3.6953         
    'ATF3'       1.2368e-193      2.75e-190      3.368         
    'MLPH'       2.0119e-185    4.0263e-182     2.5466         
    'STEAP4'     1.7537e-182    3.1905e-179     9.9479         
    'DBI'         3.787e-173    6.3155e-170     2.7759         
    'ABCC4'      8.5321e-166    1.3134e-162     2.8211         
    'KLK2'       2.7911e-163    3.9897e-160     2.9506         
    'SAT1'       1.2922e-161     1.724e-158     2.6687         
    'CAMK2N1'    8.8046e-161    1.1012e-157    -4.2901         
    'JAM3'       4.7333e-151    5.5719e-148     5.7235         
    'MBOAT2'      1.556e-140    1.7299e-137      3.285         
    'RHOU'       1.4157e-138    1.4911e-135     4.0932         
    'NNMT'       5.6484e-138    5.6517e-135     4.3572         


[1] Li, H., et al., "Determination of Tag Density Required for Digital Transcriptome Analysis: Application to an Androgen-Sensitive Prostate Cancer Model", PNAS, 105(51):20179-84, 2008.

[2] Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L., "Ultrafast and Memory-efficient Alignment of Short DNA Sequences to the Human Genome", Genome Biology, 10(3):R25, 2009.

[3] Li, H., et al., "The Sequence Alignment/map (SAM) Format and SAMtools", Bioinformatics, 25(16):2078-9, 2009.

[4] Mortazavi, A., et al., "Mapping and quantifying mammalian transcriptomes by RNA-Seq", Nature Methods, 5:621-8, 2008.

[5] Robinson, M.D. and Oshlack, A., "A Scaling Normalization method for differential Expression Analysis of RNA-seq Data", Genome Biology, 11(3):R25, 2010.

[6] Anders, S. and Huber, W., "Differential Expression Analysis for Sequence Count Data", Genome Biology, 11(10):R106, 2010.

[7] Benjamini, Y. and Hochberg, Y., "Controlling the false discovery rate: a practical and powerful approach to multiple testing", Journal of the Royal Statistical Society, 57(1):289-300, 1995.

Was this topic helpful?