Gibbs Motif Sampler

Finds motifs and the optimum width via Gibbs sampling. No toolboxes required.
1K Downloads
Updated 7 May 2010

View License

MOTIFINFO = GIBBSMOTIFSAMPLER(SEQARRAY,MOTIFVECTOR,ALPHABET,OPTIONS)
Searches for the motifs in a set of sequences via Gibbs
sampling.

seqArray is a cell vector of sequence strings.

motifVector is a vector of uniformly-spaced motif widths you wish to
optimize. The program iterates over these widths, and returns the
one that is the "best", according the technique described in Lawrence et
al, Science, Vol. 262, No. 5131 pp. 208-214, 1993.

alphabet is a string that denotes the alphabet you want to use. Use 'DNA'
to use 'ATCG', or 'Protein' to use 'ARNDCEQGHILKMFPSTWYV'. You can enter
one in of your choice as well.

Options is the options structure, which has some nifty features. The
fields are:

nTrials: The number of trials to perform for each motif alignment. The
more the better, but the longer it takes.

pcMode: A string, either 'Fraction' or 'Addition'. If you choose
'Fraction', a fractional amount of each residue from the background
will be used as the pseudocount. If you choose 'Addition', a static
amount will be used for the pseudocount.

pcVector: This is a scalar or vector quantity. This designates amount
to add to each residue for the pseudocount. If you enter a scalar, that
amount will be used for each residue in the alphabet. If you enter a
vector, the amounts will be dispersed according to the order the
characters are entered in your alphabet. For example, if you want to
use a pseudocount of 1% of the counts of all residue sequences, you
would choose pcMode as 'Fraction', and pcVector = 0.01. If you wanted
to add 25 to each count, you would enter pcMode as 'Addition', and
pcVector = 25.

nPhaseTrials: Increasing this quantity helps reduce the chance of being
caught in a local maximum. The higher it is, the better the search, but
the longer it takes.

nPhaseBand: This is how wide of a search range the sampler uses when
doing phase correction (as described in Lawrence's paper). I suggest a
value of 1/3 to 1/2 the length of your hypothesized motif.

If you omit the Options structure, the program will use these defaults:

nTrials = 500;
pcVector = 0.01;
pcMode = 'Fraction';
relativeTol = 0.01;
nTrials = 500;
nPhaseTrials = 50;
phaseBand = 25;

The program returns a structure called MotifInfo, whose fields are:

MotifSites: This is the vector of motif positions.
BestMotifWidth: The best motif width.
MaxS: The maximum value of S; which is the information per parameter.
SVector: A vector of S values, corresponding to each motif width.
ExecutionTimes: Time elapsed for each tested motif width.

Example:

sequenceArray =

'ATCTCTGAGTCGCTATATCGCTCTCG'
'TCTCTGAGATCTCCGCGCTCTCGAGTGCGATATGC'
'TATCGCGCTATGCGCTATTATATCTCTG'
'CGGCGCTATCGGATATAGCGTA'
'TCTCTCGAGATCGATCGT'
'ATTCTCTATATATCTTATATTCTTATGAGAATCTAG'

MotifInfo =

MotifSites: [6x1 double]
BestMotifWidth: 5
Motif: {6x1 cell}
MaxS: 1.4267
SVector: [3x1 double]
ExecutionTimes: [3x1 double]

MotifInfo.MotifSites = [22 1 1 7 1 32]'
MotifInfo.BestMotifWidth = 5
MotifInfo.Motif =

'TCTCG'
'TCTCT'
'TATCG'
'TATCG'
'TCTCT'
'TCTAG'

MotifInfo.MaxS = 1.4267
MotifInfo.SVector =

1.4267
1.3362
1.1279

(The execution times only really matter for your own computer.)

There is some room for improvement in this program:

1. The code isn't optimized; the execution time could be improve greatly.

2. It is possible to modify the algorithm to search for multiple motifs -
I am not just sure how to go about doing that.

3. It is also possible to determine whether a detected motif is
statistically significant or not - again, I'm just not sure how to do it.

(c) May 6, 2010 Bradley James Ridder, University of South Florida, for Dr.
Xiaoning Qian's Computational Molecular Biology Course, Spring 2010 term.

Cite As

Brad Ridder (2024). Gibbs Motif Sampler (https://www.mathworks.com/matlabcentral/fileexchange/27534-gibbs-motif-sampler), MATLAB Central File Exchange. Retrieved .

MATLAB Release Compatibility
Created with R2008b
Compatible with any release
Platform Compatibility
Windows macOS Linux
Categories
Find more on Genomics and Next Generation Sequencing in Help Center and MATLAB Answers

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!
Version Published Release Notes
1.0.0.0