Code covered by the BSD License  

Highlights from
Find an ungapped pattern window from a set of protein sequences

image thumbnail

Find an ungapped pattern window from a set of protein sequences

by

 

05 Dec 2011 (Updated )

This program is to find an ungapped pattern window of certain width from a set of protein sequences

gibbs_script_4_1.m
% Program: Find an ungapped pattern window from a set of protein sequences
% Authors: Xugang Ye and Stephen Altschul
% @NCBI/NIH
% Summer 2011
% Email: yex3@mail.nih.gov
%{
  This demo is to instruct how to use the mex function find_patternwindow_v4_1 to
  find an ungapped pattern window of centain length from a set of protein sequences.

  Syntax:
  [win_score, pos] = find_patternwindow_v4_1(win_len, seqs_array, n_s, n, letter_order, l, a, al_pha, m, n_iter, n_persist, n_shift, p_sampling2)
  Outputs:
                       win_score: scalar (int) --- The Bayesian log-odds ratio score of the returned pattern window 
                       pos: l by n_s (int) --- The positions of the returned pattern window in the sequences
  Inputs:
                       win_len: scalar (int) --- the length of the pattern window, suggested value is between 5 and 20
                       seqs_array: n_s by n (char) --- the set of input sequences
                       n_s: scalar (int) --- Number of input sequences
                       n: scalar (int) --- the maximum length of the input sequences
                       letter_order: 1 by l (char) --- the ordered letter dictionary (there are 20 amino acids)
                       l: scalar (int) --- the number of letters (it's 20 because there are 20 amino acids)
                       a: 1 by m (double) --- the mixture weights of the Dirichlet mixture prior
                       al_pha: m by l (double)  --- the pseudo-count parameters of the Dirichlet mixture prior
                       m: scalar (int) --- Number of Dirichlet components in the Dirichlet mixture prior
                       n_iter: scalar (int) --- Maximum number of Gibbs iterations
                       n_persist: scalar (int) --- Maximum number of persisting iterations when the objective value does not improve
                       n_shift: scalar (int) --- Maximum window shifts, suggested value is 4
                       p_sampling2: scalar (double) --- the probability of type 2 sampling, suggested value is 1.0/(n_s + 1)
%}

% The first thing is to read the sequences. The matlab function fastaread
% in the bioinformatics toolbox can be used
% Input the sequences
%seqs_fasta = fastaread('../data/C2_pten_seqs.txt');
seqs_fasta = fastaread('../data/StephenData1.txt');
n_s = length(seqs_fasta), % The numbder of sequences

% When sequences are read into the data object seqs_fasta, the next is to
% generate the sequence array to store the sequences. If a sequence's
% length is less than the maximum length, just use "-" to make up.
% Below is a simple way to generate the sequence array

% Find the maximum length of the sequences and store the number into n
n = 0;
for i = 1:n_s
    if n < length(seqs_fasta(i).Sequence)
        n = length(seqs_fasta(i).Sequence);
    end
end
n,
s0 = [];
for j = 1:n
    s0 = [s0 '-'];
end
s0,
seqs_array = [];
for i = 1:n_s
    seq_i = s0;
    n_i = length(seqs_fasta(i).Sequence);
    seq_i(1:n_i) = seqs_fasta(i).Sequence;
    seqs_array = [seqs_array;seq_i];
end
seqs_array, % Display the sequence array in command window
showalignment(seqs_array); % Display the sequence array using the bioinformatics toolbox function showalignment

% Set the prior parameters
letter_order = 'ARNDCQEGHILKMFPSTWYV';
load_DM_1; % => a, al_pha, which are hyper parameters
[m,l] = size(al_pha);m,% m is the number of components of DM prior
                       % l is the number of letters (= 20)
% Note that the Dirichlet mixture prior used here is consistent with the
% order of letters: 'ARNDCQEGHILKMFPSTWYV'
% If a different Dirichlet mixture prior is used, its order of letters must be specified 
                       
win_len = 15, % In this demo, 15 is chosen, but different numbers between 5 and 20 can be tried

% The following parameters are set as default
n_iter = 10000,
n_persist = 2000,
n_shift = 4,
p_sampling2 = 1.0/(n_s + 1),
rand('state', 0);
% -------------------------------------------

run_time = cputime;
% Function call
[win_score, pos] = find_patternwindow_v4_1(win_len, seqs_array, n_s, n, letter_order, l, a, al_pha, m, n_iter, n_persist, n_shift, p_sampling2),
run_time = cputime - run_time,
pattern_win = [];
for i = 1:n_s
    pattern_win = [pattern_win;seqs_array( i, pos(i):pos(i)+win_len-1 )];
end
pattern_win, % Display the returned pattern in command window 
showalignment(pattern_win); % Display the returned pattern using the bioinformatics toolbox function showalignment

Contact us