Find an ungapped pattern window from a set of protein sequences


05 Dec 2011 (Updated )

This program is to find an ungapped pattern window of certain width from a set of protein sequences

This program is a bioinformatics tool developed for helping biologists finding patterns from a set of protein sequences. The method is the first one that fully utilizes the advantages of the Dirichlet mixture models. It starts from a random pattern and iteratively improves the Bayesian log-odds ratio score as the pattern is updated. When the score cannot be significantly improved, the algorithm terminates and returns a pattern window of pre-specified length. The resulting pattern can be used as a starting point for later refined alignment through introducing gaps. We are developing the more advanced version that can introduce gaps into the pattern. We believe the current ungapped version is already very helpful for identifying conserved regions of the protein sequences. It is a useful tool that can save a certain amount of manual work in the pattern discovery.

To use the c program, first compile it using mex in linux/unix, then run the demo script gibbs_script_4_1.m. You can manipulate the demo script for your needs.

Required Products Bioinformatics Toolbox
Statistics Toolbox
MATLAB release MATLAB 7.8 (R2009a)
Other requirements mex compiler in linux/unix; to have better view of the results, run win32 and then run matlab
08 Jan 2013 Tammy Tatley

It's very useful for my research at Fox Chase Cancer Center! I hope more people in this field will know this to help their research.

Very useful code. Thanks a lot!

08 Dec 2011 Quan Zhang

I am happy to find this software. It is very useful.

08 Dec 2011 Jerry

Great. I am looking for such a tool for a long time. Thank you for sharing, Xugang

06 Dec 2011 cathy

wonderful~ I have already look it up for a long time. Thank you. It's very useful.

Looks very nice!, the demo looks beautiful. - P.L.

06 Dec 2011 Xugang Ye

Dear Professor Prandtl,

I used a 20-component Dirichlet mixture prior that is provided by UCSC, here is the website:

The prior as defult in this program is called "recode4.20comp". But the order of the amino acid letter is different, they use "ACDEFGHIKLMNPQRSTVWY", but I use "ARNDCQEGHILKMFPSTWYV". Make sure the prior and the order of letters are consistent.


I am a faculty member at the University of Sheffield, U.K., I found your program is very intersting. I have a question on the prior you choose. Can I choose different priors?

- P.L.

Fantastic! It is exactly what I am looking for! Thank you!

JHMI (Johns Hopkins Medical Institutions)

Hi, Judy,

Yes. By the way, Could you let me know what Lab are you working for? Thanks.


Hi, Xugang,

Thanks. Should I put my sequence data in fasta format into the subfolder called "data"?


Hi, Judy,

Thanks for your interest of using the codes. First, upload the folder to your server if you have not done so. Then enter the sub-folder codes. And the .c file you mentioned is find_patternwindow_v4_1.c
that's a computing function written in c language. Other than directly using the usual gcc compiler, you need to use mex to compile it so that the function can be called by your matlab scripts.
type "mex -setup", then you may be given several options. In linux/unix I recommanded, you just chose the first option that use gcc-mex compiler. When you are asked whether to overwrite the file, you answer is yes. Then the next is simply to type

mex find_patternwindow_v4_1.c

you will find that an executable file


is created. Then you have matlab function
find_patternwindow_v4_1() to use.

Feel free to ask any question


Hello, Xugang,

I am trying to use your codes, how to compile the .c file? thanks


Fantastic work! I am a researcher at the National Institutes of Health, Bethesda, MD, I found this program is very useful to my research in sequence-based domain detection.

correted a typo (gapps -> gaps) in the description

