Code covered by the BSD License  

Highlights from
Gibbs Motif Sampler

Be the first to rate this file! 17 Downloads (last 30 days) File Size: 5.61 KB File ID: #27534

Gibbs Motif Sampler

by Brad Ridder

 

07 May 2010

Finds motifs and the optimum width via Gibbs sampling. No toolboxes required.

| Watch this File

File Information
Description

MOTIFINFO = GIBBSMOTIFSAMPLER(SEQARRAY,MOTIFVECTOR,ALPHABET,OPTIONS)
 Searches for the motifs in a set of sequences via Gibbs
 sampling.
 
 seqArray is a cell vector of sequence strings.
 
 motifVector is a vector of uniformly-spaced motif widths you wish to
 optimize. The program iterates over these widths, and returns the
 one that is the "best", according the technique described in Lawrence et
 al, Science, Vol. 262, No. 5131 pp. 208-214, 1993.
 
 alphabet is a string that denotes the alphabet you want to use. Use 'DNA'
 to use 'ATCG', or 'Protein' to use 'ARNDCEQGHILKMFPSTWYV'. You can enter
 one in of your choice as well.
 
 Options is the options structure, which has some nifty features. The
 fields are:
 
    nTrials: The number of trials to perform for each motif alignment. The
    more the better, but the longer it takes.
 
    pcMode: A string, either 'Fraction' or 'Addition'. If you choose
    'Fraction', a fractional amount of each residue from the background
    will be used as the pseudocount. If you choose 'Addition', a static
    amount will be used for the pseudocount.
 
    pcVector: This is a scalar or vector quantity. This designates amount
    to add to each residue for the pseudocount. If you enter a scalar, that
    amount will be used for each residue in the alphabet. If you enter a
    vector, the amounts will be dispersed according to the order the
    characters are entered in your alphabet. For example, if you want to
    use a pseudocount of 1% of the counts of all residue sequences, you
    would choose pcMode as 'Fraction', and pcVector = 0.01. If you wanted
    to add 25 to each count, you would enter pcMode as 'Addition', and
    pcVector = 25.
 
    nPhaseTrials: Increasing this quantity helps reduce the chance of being
    caught in a local maximum. The higher it is, the better the search, but
    the longer it takes.
 
    nPhaseBand: This is how wide of a search range the sampler uses when
    doing phase correction (as described in Lawrence's paper). I suggest a
    value of 1/3 to 1/2 the length of your hypothesized motif.
 
    If you omit the Options structure, the program will use these defaults:
 
      nTrials = 500;
      pcVector = 0.01;
      pcMode = 'Fraction';
      relativeTol = 0.01;
      nTrials = 500;
      nPhaseTrials = 50;
      phaseBand = 25;
 
 The program returns a structure called MotifInfo, whose fields are:
 
    MotifSites: This is the vector of motif positions.
    BestMotifWidth: The best motif width.
    MaxS: The maximum value of S; which is the information per parameter.
    SVector: A vector of S values, corresponding to each motif width.
    ExecutionTimes: Time elapsed for each tested motif width.
 
 Example:
 
  sequenceArray =
 
      'ATCTCTGAGTCGCTATATCGCTCTCG'
      'TCTCTGAGATCTCCGCGCTCTCGAGTGCGATATGC'
      'TATCGCGCTATGCGCTATTATATCTCTG'
      'CGGCGCTATCGGATATAGCGTA'
      'TCTCTCGAGATCGATCGT'
      'ATTCTCTATATATCTTATATTCTTATGAGAATCTAG'
 
  MotifInfo =
 
          MotifSites: [6x1 double]
      BestMotifWidth: 5
               Motif: {6x1 cell}
                MaxS: 1.4267
             SVector: [3x1 double]
      ExecutionTimes: [3x1 double]
 
  MotifInfo.MotifSites = [22 1 1 7 1 32]'
  MotifInfo.BestMotifWidth = 5
  MotifInfo.Motif =
 
      'TCTCG'
      'TCTCT'
      'TATCG'
      'TATCG'
      'TCTCT'
      'TCTAG'
 
 MotifInfo.MaxS = 1.4267
 MotifInfo.SVector =
 
         1.4267
         1.3362
         1.1279
 
 (The execution times only really matter for your own computer.)
 
 There is some room for improvement in this program:
 
 1. The code isn't optimized; the execution time could be improve greatly.
 
 2. It is possible to modify the algorithm to search for multiple motifs -
 I am not just sure how to go about doing that.
 
 3. It is also possible to determine whether a detected motif is
 statistically significant or not - again, I'm just not sure how to do it.
 
 (c) May 6, 2010 Bradley James Ridder, University of South Florida, for Dr.
 Xiaoning Qian's Computational Molecular Biology Course, Spring 2010 term.

MATLAB release MATLAB 7.7 (R2008b)
Tags for This File  
Everyone's Tags
Tags I've Applied
Add New Tags Please login to tag files.
Please login to add a comment or rating.
Tag Activity for this File
Tag Applied By Date/Time
gibbs Brad Ridder 10 May 2010 13:01:54
sampling Brad Ridder 10 May 2010 13:01:54
markov Brad Ridder 10 May 2010 13:01:54
metropolishastings Brad Ridder 10 May 2010 13:01:54
protein Brad Ridder 10 May 2010 13:01:54
dna Brad Ridder 10 May 2010 13:01:54
nucleotide Brad Ridder 10 May 2010 13:01:54
motif Brad Ridder 10 May 2010 13:01:54
montecarlo Brad Ridder 10 May 2010 13:01:54
motif search Brad Ridder 10 May 2010 13:01:54
optimization Brad Ridder 10 May 2010 13:01:54
combinatoric Brad Ridder 10 May 2010 13:01:54
stochastic Brad Ridder 10 May 2010 13:01:54
biotech Brad Ridder 10 May 2010 13:01:54
sequence Brad Ridder 10 May 2010 13:01:54
statistics Brad Ridder 10 May 2010 13:01:54

Contact us at files@mathworks.com