% Program: Find an ungapped pattern window from a set of protein sequences
% Authors: Xugang Ye and Stephen Altschul
% @NCBI/NIH
% Summer 2011
% Email: yex3@mail.nih.gov
%{
This demo is to instruct how to use the mex function find_patternwindow_v4_1 to
find an ungapped pattern window of centain length from a set of protein sequences.
Syntax:
[win_score, pos] = find_patternwindow_v4_1(win_len, seqs_array, n_s, n, letter_order, l, a, al_pha, m, n_iter, n_persist, n_shift, p_sampling2)
Outputs:
win_score: scalar (int) --- The Bayesian log-odds ratio score of the returned pattern window
pos: l by n_s (int) --- The positions of the returned pattern window in the sequences
Inputs:
win_len: scalar (int) --- the length of the pattern window, suggested value is between 5 and 20
seqs_array: n_s by n (char) --- the set of input sequences
n_s: scalar (int) --- Number of input sequences
n: scalar (int) --- the maximum length of the input sequences
letter_order: 1 by l (char) --- the ordered letter dictionary (there are 20 amino acids)
l: scalar (int) --- the number of letters (it's 20 because there are 20 amino acids)
a: 1 by m (double) --- the mixture weights of the Dirichlet mixture prior
al_pha: m by l (double) --- the pseudo-count parameters of the Dirichlet mixture prior
m: scalar (int) --- Number of Dirichlet components in the Dirichlet mixture prior
n_iter: scalar (int) --- Maximum number of Gibbs iterations
n_persist: scalar (int) --- Maximum number of persisting iterations when the objective value does not improve
n_shift: scalar (int) --- Maximum window shifts, suggested value is 4
p_sampling2: scalar (double) --- the probability of type 2 sampling, suggested value is 1.0/(n_s + 1)
%}
% The first thing is to read the sequences. The matlab function fastaread
% in the bioinformatics toolbox can be used
% Input the sequences
%seqs_fasta = fastaread('../data/C2_pten_seqs.txt');
seqs_fasta = fastaread('../data/StephenData1.txt');
n_s = length(seqs_fasta), % The numbder of sequences
% When sequences are read into the data object seqs_fasta, the next is to
% generate the sequence array to store the sequences. If a sequence's
% length is less than the maximum length, just use "-" to make up.
% Below is a simple way to generate the sequence array
% Find the maximum length of the sequences and store the number into n
n = 0;
for i = 1:n_s
if n < length(seqs_fasta(i).Sequence)
n = length(seqs_fasta(i).Sequence);
end
end
n,
s0 = [];
for j = 1:n
s0 = [s0 '-'];
end
s0,
seqs_array = [];
for i = 1:n_s
seq_i = s0;
n_i = length(seqs_fasta(i).Sequence);
seq_i(1:n_i) = seqs_fasta(i).Sequence;
seqs_array = [seqs_array;seq_i];
end
seqs_array, % Display the sequence array in command window
showalignment(seqs_array); % Display the sequence array using the bioinformatics toolbox function showalignment
% Set the prior parameters
letter_order = 'ARNDCQEGHILKMFPSTWYV';
load_DM_1; % => a, al_pha, which are hyper parameters
[m,l] = size(al_pha);m,% m is the number of components of DM prior
% l is the number of letters (= 20)
% Note that the Dirichlet mixture prior used here is consistent with the
% order of letters: 'ARNDCQEGHILKMFPSTWYV'
% If a different Dirichlet mixture prior is used, its order of letters must be specified
win_len = 15, % In this demo, 15 is chosen, but different numbers between 5 and 20 can be tried
% The following parameters are set as default
n_iter = 10000,
n_persist = 2000,
n_shift = 4,
p_sampling2 = 1.0/(n_s + 1),
rand('state', 0);
% -------------------------------------------
run_time = cputime;
% Function call
[win_score, pos] = find_patternwindow_v4_1(win_len, seqs_array, n_s, n, letter_order, l, a, al_pha, m, n_iter, n_persist, n_shift, p_sampling2),
run_time = cputime - run_time,
pattern_win = [];
for i = 1:n_s
pattern_win = [pattern_win;seqs_array( i, pos(i):pos(i)+win_len-1 )];
end
pattern_win, % Display the returned pattern in command window
showalignment(pattern_win); % Display the returned pattern using the bioinformatics toolbox function showalignment