This program is a bioinformatics tool developed for helping biologists finding patterns from a set of protein sequences. The method is the first one that fully utilizes the advantages of the Dirichlet mixture models. It starts from a random pattern and iteratively improves the Bayesian log-odds ratio score as the pattern is updated. When the score cannot be significantly improved, the algorithm terminates and returns a pattern window of pre-specified length. The resulting pattern can be used as a starting point for later refined alignment through introducing gaps. We are developing the more advanced version that can introduce gaps into the pattern. We believe the current ungapped version is already very helpful for identifying conserved regions of the protein sequences. It is a useful tool that can save a certain amount of manual work in the pattern discovery.
To use the c program, first compile it using mex in linux/unix, then run the demo script gibbs_script_4_1.m. You can manipulate the demo script for your needs.
It's very useful for my research at Fox Chase Cancer Center! I hope more people in this field will know this to help their research.
Very useful code. Thanks a lot!
I am happy to find this software. It is very useful.
Great. I am looking for such a tool for a long time. Thank you for sharing, Xugang
wonderful~ I have already look it up for a long time. Thank you. It's very useful.
Looks very nice!, the demo looks beautiful. - P.L.
Dear Professor Prandtl,
I used a 20-component Dirichlet mixture prior that is provided by UCSC, here is the website:
The prior as defult in this program is called "recode4.20comp". But the order of the amino acid letter is different, they use "ACDEFGHIKLMNPQRSTVWY", but I use "ARNDCQEGHILKMFPSTWYV". Make sure the prior and the order of letters are consistent.
I am a faculty member at the University of Sheffield, U.K., I found your program is very intersting. I have a question on the prior you choose. Can I choose different priors?
Fantastic! It is exactly what I am looking for! Thank you!
JHMI (Johns Hopkins Medical Institutions)
Yes. By the way, Could you let me know what Lab are you working for? Thanks.
Thanks. Should I put my sequence data in fasta format into the subfolder called "data"?
Thanks for your interest of using the codes. First, upload the folder to your server if you have not done so. Then enter the sub-folder codes. And the .c file you mentioned is find_patternwindow_v4_1.c
that's a computing function written in c language. Other than directly using the usual gcc compiler, you need to use mex to compile it so that the function can be called by your matlab scripts.
type "mex -setup", then you may be given several options. In linux/unix I recommanded, you just chose the first option that use gcc-mex compiler. When you are asked whether to overwrite the file mexopts.sh, you answer is yes. Then the next is simply to type
you will find that an executable file
is created. Then you have matlab function
find_patternwindow_v4_1() to use.
Feel free to ask any question
I am trying to use your codes, how to compile the .c file? thanks
Fantastic work! I am a researcher at the National Institutes of Health, Bethesda, MD, I found this program is very useful to my research in sequence-based domain detection.
correted a typo (gapps -> gaps) in the description