image thumbnail

Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data



MATLAB code for the "Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data" webinar.

%% Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data
% This demonstration performs a genome-wide analysis of a transcription
% factor in the _Arabidopsis Thaliana_ (Thale Cress) model organism. 

%  Copyright 2010 The MathWorks, Inc.

% For enhanced performance, it is recommended that you run this demo on a
% 64-bit platform, because the memory footprint is close to 2 Gb. On a
% 32-bit platform, if you receive |"Out of memory"| errors when running
% this demo, try increasing the virtual memory (or swap space) of your
% operating system or try setting the |3GB| switch (32-bit Windows(R) XP
% only). These techniques are described in this
% < document>. 

%% Introduction
% ChIP-Seq is a technology that is used to identify transcription factors
% that interact with specific DNA sites. First chromatin
% immunoprecipitation enriches DNA-protein complexes using an antibody that
% binds to a particular protein of interest. Then, all the resulting
% fragments are processed using high-throughput sequencing. Sequencing
% fragments are mapped back to the reference genome. By inspecting 
% over-represented regions it is possible to mark the genomic location of
% DNA-protein interactions. 

% In this demonstration, short reads are produced by the paired-end
% Illumina(R) platform. Each fragment is reconstructed from two short reads
% successfully mapped, with this the exact length of the fragment can be
% computed. Using paired-end information from sequence reads maximizes the
% accuracy of predicting DNA-protein binding sites.

%% Data Set
% This demo explores the paired-end ChIP-Seq data generated by Wang
% [1] using the Illumina(R) platform. The data set has been
% courteously submitted to the Gene Expression Omnibus repository with
% accession number GSM424618. The unmapped paired-end reads can be obtained
% from the < NCBI FTP
% site>. 

% This demonstration assumes that you:
% (1) downloaded and uncompressed the unmapped paired-end reads
% (|SRR054715_1.fasta.bz2| and |SRR054715_2.fasta.bz2|), 
% (2) produced a SAM formatted file by mapping the short reads to the Thale
% Cress reference genome, using a mapper such as BWA [2], Bowtie, or SSAHA2
% (which is the mapper used by authors of [1]), and,  
% (3) ordered the SAM formatted file by reference name first, then by
% genomic position.

% For the published version of this demo, 8,653,488 paired-end short reads
% are mapped using the BWA mapper [2]. BWA produced a SAM formatted file
% (|aratha.sam|) with 17,306,976 records (8,653,488 x 2). Repetitive hits
% were randomly chosen, and only one hit is reported, but with lower
% mapping quality. The SAM file was ordered using SAMtools [3] before being
% loaded into MATLAB.

% The last part of the demo also assumes that you downloaded the reference
% genome for the Thale Cress model organism (which includes five
% chromosomes). Uncomment the following lines of code to download the
% reference from the NCBI repository:

% getgenbank('NC_003070','FileFormat','fasta','tofile','ach1.fasta');
% getgenbank('NC_003071','FileFormat','fasta','tofile','ach2.fasta');
% getgenbank('NC_003074','FileFormat','fasta','tofile','ach3.fasta');
% getgenbank('NC_003075','FileFormat','fasta','tofile','ach4.fasta');
% getgenbank('NC_003076','FileFormat','fasta','tofile','ach5.fasta');

%% Indexing the SAM File and Creating a MATLAB(R) Interface to Access it
% Use |BioIndexedFile| to index the SAM formatted file. This allows quick
% and efficient access to the short reads from MATLAB without having to
% load the whole content of the file into memory. Display a small number of
% the short reads. Observe that each short read is mapped to one of the
% five reference chromosomes or it appears as unmapped (*).

bif = BioIndexedFile('sam',which('aratha.sam'),'indexedbykeys',false)
i = 1:1000000:17000000;  % some random sequences

% Use the method |getDictionary| to obtain a list of all the reference
% names present in the source file of |BioIndexedFile|.


% To create local alignments and look at the coverage we need to construct
% a |BioMap|. |BioMap| can use the |BioIndexedFile| as the interface to the
% actual data, thus minimizing the amount of data that is actually loaded
% to the workspace. The remainder of this demonstration focuses on the
% analysis of one of the five chromosomes, |Chr1|. Create a |BioMap| to
% access the short reads mapped to the first chromosome.

bm1 = BioMap(bif,'SubsetReference','Chr1');

% Further inspection of |BioMap| can indicate the range of the mapped short
% reads and how many short reads are mapped to the first chromosome. 

x1 = min(getStart(bm1));
x2 = max(getStop(bm1));
fprintf('Chromosome 1 Range: %d-%d\n',x1,x2)
fprintf('Chromosome 1 Number of Reads: %d\n',bm1.NSeqs)

%% Exploring the Coverage at Different Resolutions
% To explore the coverage for the whole range of the chromosome, a binning
% algorithm is required. The |getBaseCoverage| method produces a coverage
% signal based on effective alignments. It also allows you to specify a bin
% width to control the size (or resolution) of the output signal. However
% internal computations are still performed at the base pair (bp)
% resolution. This means that despite setting a large bin size, narrow
% peaks in the coverage signal can still be observed. Once the coverage
% signal is plotted you can program the figure's data cursor to display the
% genomic position when using the tooltip. You can zoom and pan the figure
% to determine the position and height of the ChIP-Seq peaks.

[cov,bin] = getBaseCoverage(bm1,x1,x2,'binWidth',1000,'binType','max');
ylim([0 100])
xlabel('Base position')
title('Coverage in Chromosome 1')

mdc = @(h,e)  {['Position: ',num2str(get(e,'Position')*[1;0])], ['Y: ',num2str(get(e,'Position')*[0;1])]}
datacursormode on

% It is also possible to explore the coverage signal at the bp resolution
% (also referred to as the _pile-up_ profile). Explore one of the large
% peaks observed in the data at position 4598837.

p1 = 4598837-1000;
p2 = 4598837+1000;

xlabel('Base position')
title('Coverage in Chromosome 1')
datacursormode on

%% Identifying and Filtering Regions with Artifacts
% Observe the large peak with coverage depth of 800+ between positions
% 4599029 and 4599145. Investigate how these reads are aligning to the
% reference chromosome. You can retrieve a subset of these reads enough to
% satisfy a coverage depth of 25, since this is sufficient to understand
% what is happening in this region. Use |getIndex| to obtain indices to
% this subset. Then use |getCompactAlignment| to display the corresponding
% multiple alignment of the short-reads.

i = getIndex(bm1,4599029,4599145,'depth',25);
bmx = getSubset(bm1,i,'indexed',false)

% In addition to visually confirming the alignment, you can also explore
% the mapping quality for all the short reads in this region, as this may
% hint to a potential problem. In this case, less than one percent of the
% short reads have a Phred quality of 60, indicating that the mapper most
% likely found multiple hits within the reference genome, hence assigning a
% lower mapping quality.

i = getIndex(bm1,4599029,4599145);
title('Mapping Quality of the reads between 4599029 and 4599145')
xlabel('Phred Quality Score')
ylabel('Number of Reads')

% Most of the large peaks in this data set occur due to satellite repeat
% regions or due to its closeness to the centromere [4], and show
% characteristics similar to the example just explored. You may explore
% other regions with large peaks using the same procedure.

% To prevent these problematic regions, two techniques are used. First,
% given that the provided data set uses paired-end sequencing, by removing
% the reads that are not aligned in a proper pair reduces the number of
% potential aligner errors or ambiguities. You can achieve this by
% exploring the |flag| field of the SAM formatted file, in which the second
% less significant bit is used to indicate if the short read is mapped in
% a proper pair.

i = find(bitget(getFlag(bm1),2));
bm1_filtered = getSubset(bm1,i)

% Second, consider only uniquely mapped reads. You can detect reads that
% are equally mapped to different regions of the reference sequence by
% looking at the mapping quality, because BWA assigns a lower mapping
% quality (less than 60) to this type of short read.

i = find(getMappingQuality(bm1_filtered)==60);
bm1_filtered = getSubset(bm1_filtered,i)

% Visualize again the filtered data set using both, a coarse resolution
% with 1000 bp bins for the whole chromosome, and a fine resolution for a
% small region of 20,000 bp. Most of the large peaks due to artifacts have
% been removed.

[cov,bin] = getBaseCoverage(bm1_filtered,x1,x2,'binWidth',1000,'binType','max');
ylim([0 100])
xlabel('Base Position')
title('Coverage in Chromosome 1 after Filtering')
datacursormode on

p1 = 24275801-10000;
p2 = 24275801+10000;

xlabel('Base Position')
title('Coverage in Chromosome 1 after Filtering')
datacursormode on

%% Recovering Sequencing Fragments from the Paired-End Reads
% In Wang's paper [1] it is hypothesized that paired-end sequencing data
% has the potential to increase the accuracy of the identification of
% chromosome binding sites of DNA associated proteins because the fragment
% length can be derived accurately, while when using single-end
% sequencing it is necessary to resort to a statistical approximation of
% the fragment length, and use it indistinctly for all putative binding
% sites. 

% Use the paired-end reads to reconstruct the sequencing fragments. First,
% get the indices for the forward and the reverse reads in each pair. This
% information is captured in the fifth bit of the |flag| field, according
% to the SAM file format.

fow_idx = find(~bitget(getFlag(bm1_filtered),5));
rev_idx = find(bitget(getFlag(bm1_filtered),5));

% SAM-formatted files use the same header strings to identify pair mates.
% By pairing the header strings you can determine how the short reads in
% |BioMap| are paired. To pair the header strings, simply order them in
% ascending order and use the sorting indices (|hf| and |hr|) to link the
% unsorted header strings.

[~,hf] = sort(getHeader(bm1_filtered,fow_idx));
[~,hr] = sort(getHeader(bm1_filtered,rev_idx));
mate_idx = zeros(numel(fow_idx),1);
mate_idx(hf) = rev_idx(hr);

% Use the resulting |fow_idx| and |mate_idx| variables to retrieve pair
% mates. For example, retrieve the paired-end reads for the first 10
% fragments. 

for j = 1:10
  disp(getInfo(bm1_filtered, fow_idx(j)))
  disp(getInfo(bm1_filtered, mate_idx(j)))

% Use the paired-end indices to construct a new |BioMap| with the minimal
% information needed to represent the sequencing fragments. First,
% calculate the insert sizes.

J = getStop(bm1_filtered, fow_idx);
K = getStart(bm1_filtered, mate_idx);
L = K - J - 1;

% Obtain the new signature (or CIGAR string) for each fragment by using the
% short read original signatures separated by the appropriate number of
% skip CIGAR symbols (|N|).

n = numel(L);
cigars = cell(n,1);
for i = 1:n
   cigars{i} = sprintf('%dN' ,L(i));
cigars = strcat( getSignature(bm1_filtered, fow_idx),...
                 getSignature(bm1_filtered, mate_idx));

% Reconstruct the sequences for the fragments by concatenating the
% respective sequences of the paired-end short reads.

seqs = strcat( getSequence(bm1_filtered, fow_idx),...
               getSequence(bm1_filtered, mate_idx));

% Calculate and plot the fragment size distribution.
J = getStart(bm1_filtered,fow_idx);
K = getStop(bm1_filtered,mate_idx);
L = K - J + 1;
title(sprintf('Fragment Size Distribution\n %d Paired-end Fragments Mapped to Chromosome 1',n))
xlabel('Fragment Size')

% Construct a new |BioMap| to represent the sequencing fragments. With
% this, you will be able explore the coverage signals as well as local
% alignments of the fragments.

bm1_fragments = BioMap('Sequence',seqs,'Signature',cigars,'Start',J)

%% Exploring the Coverage Using Fragment Alignments
% Compare the coverage signal obtained by using the reconstructed fragments
% with the coverage signal obtained by using individual paired-end reads.
% Notice that enriched binding sites, represented by peaks, can be better
% discriminated from the background signal. 

cov_reads = getBaseCoverage(bm1_filtered,x1,x2,'binWidth',1000,'binType','max');
[cov_fragments,bin] = getBaseCoverage(bm1_fragments,x1,x2,'binWidth',1000,'binType','max');

xlabel('Base position')
title('Coverage Comparison')
legend('Short Reads','Fragments')
datacursormode on

% Perform the same comparison at the bp resolution. In this dataset, Wang
% [1] investigated a basic helix-loop-helix (_bHLH_) transcription
% factor. _bHLH_ proteins typically bind to a consensus sequence called an
% _E-box_ (with a |CANNTG| motif). Use |fastaread| to load the reference
% chromosome, search for the _E-box_ motif in the 3' and 5' directions, and
% then overlay the motif positions on the coverage signals. This example
% works over the region 1-200,000, however the figure limits are narrowed
% to a 3000 bp region in order to better depict the details.

p1 = 1;
p2 = 200000;

cov_reads = getBaseCoverage(bm1_filtered,p1,p2);
[cov_fragments,bin] = getBaseCoverage(bm1_fragments,p1,p2);

chr1 = fastaread('ach1.fasta');
mp1 = regexp(chr1.Sequence(p1:p2),'CA..TG')+3+p1;
mp2 = regexp(chr1.Sequence(p1:p2),'GT..AC')+3+p1;
motifs = [mp1 mp2];

hold on
xlabel('Base position')
title('Coverage Comparison')
legend('Short Reads','Fragments','E-box motif')
datacursormode on
xlim([111000 114000])

% Observe that it is not possible to associate each peak in the coverage
% signals with an _E-box_ motif. This is because the length of the
% sequencing fragments is comparable to the average motif distance,
% blurring peaks that are close. Plot the distribution of the distances
% between the _E-box_ motif sites.

motif_sep = diff(sort(motifs));
title('Distance (bp) between adjacent E-box motifs')
xlabel('Distance (bp)')

%% Finding Significant Peaks in the Coverage Signal
% Use the function |mspeaks| to perform peak detection with Wavelets
% denoising on the coverage signal of the fragment alignments. Filter
% putative ChIP peaks using a height filter to remove peaks that are not
% enriched by the binding process under consideration.

putative_peaks = mspeaks(bin,cov_fragments,'noiseestimator',20,...
hold on
plot([1;1;1]*motifs(motifs>p1 & motifs<p2),[0;max(ylim);NaN],'r')
legend('Coverage from Fragments','Wavelet Denoised Coverage','Putative ChIP peaks','E-box Motifs')
xlabel('Base position')
title('ChIP-Seq Peak Detection')
datacursormode on
xlim([111000 114000])

% Use the |knnsearch| function to find the closest motif to each one of the
% putative peaks. As expected, most of the enriched ChIP peaks are close to
% an _E-box_ motif [1]. This reinforces the importance of performing peak
% detection at the finest resolution possible (bp resolution) when the
% expected density of binding sites is high, as it is in the case of the
% _E-box_ motif. It also demonstrates that for this type of analysis,
% paired-end sequencing should be considered over single-end sequencing [1].

h = knnsearch(motifs',putative_peaks(:,1));
distance = putative_peaks(:,1)-motifs(h(:))';
title('Distance to Closest E-box Motif for Each Detected Peak')
xlabel('Distance (bp)')

%% References
% [1] Wang C., Xu J., Zhang D., Wilson Z.A., and Zhang D. "An effective
%     approach for identification of in vivo protein-DNA binding sites from
%     paired-end ChIP-Seq data", BMC Bioinformatics, 11:81, Feb 9, 2010. 
% [2] Li H. and Durbin R. "Fast and accurate short read alignment with
%     Burrows-Wheeler transform", Bioinformatics, 25, pp 1754-60, 2009.
% [3] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N.,
%     Marth, G., Abecasis, G., Durbin, R. and 1000 Genome Project Data
%     Processing Subgroup "The Sequence Alignment/map (SAM) Format and
%     SAMtools", Bioinformatics, 25, pp 2078-2079, 2009.
% [4] Jothi R, Cuddapah S, Barski A, Cui K, Zhao K. "Genome-wide
%     identification of in vivo protein-DNA binding sites from ChIP-Seq
%     data", Nucleic Acids Research, 36(16), pp 5221-31, Sep 2008.
% [5] Hoofman B.G., and Jones S.J.M. "Genome-wide identification of
%     DNAprotein interactions using chromatin immunoprecipitation coupled
%     with flow cell sequencing", Journal of Endocrinology 201, pp 1-13,
%     2009. 
% [6] Ramsey SA, Knijnenburg TA, Kennedy KA, Zak DE, Gilchrist M, Gold ES,
%     Johnson CD, Lampano AE, Litvak V, Navarro G, Stolyar T, Aderem A,
%     Shmulevich I. "Genome-wide histone acetylation data improve
%     prediction of mammalian transcription factor binding sites",
%     Bioinformatics, 26(17), pp 2071-5, Sep 1, 2010.
% *< Provide feedback for this demo.>*


Contact us