Bioinformatics Toolbox 3.4
Using HMMs for Profile Analysis of a Protein Family
Sequence comparison is a key tool in bioinformatics. The objective of a HMM profile is to statistically model patterns in biological sequences by identifying combinations of matches, in-dels, and gaps in the alignment of a query sequence to a profile model. HMM profile analysis can be used for multiple sequence alignment, for a database search, to analyze sequence composition and pattern segmentation, and to predict protein structure and locate genes by predicting open reading frames. This demonstration shows how HMM profiles are used to characterize protein families.
Contents
Accessing PFAM Databases
Starting with an already built HMM of a protein family, retrieve the model for the well-known 7-fold transmembrane receptor from the Sanger Institute database. The PFAM key number is PF00002. Also retrieve the pre-aligned sequences used to train this model. More information about the PFAM database can be found at http://www.sanger.ac.uk/. If the Sanger site is not accessible, mirror sites may also be accessed.
hmm_7tm = gethmmprof(2) seed_seqs = gethmmalignment(2,'type','seed');
Warning: Consensus annotation not implemented in this parser,
consensus annotation characters will be ignored.
hmm_7tm =
Name: '7tm_2'
PfamAccessionNumber: 'PF00002.16'
ModelDescription: '7 transmembrane receptor (Secretin family)'
ModelLength: 293
Alphabet: 'AA'
MatchEmission: [293x20 double]
InsertEmission: [293x20 double]
NullEmission: [1x20 double]
BeginX: [294x1 double]
MatchX: [292x4 double]
InsertX: [292x2 double]
DeleteX: [292x2 double]
FlankingInsertX: [2x2 double]
LoopX: [2x2 double]
NullX: [2x1 double]
Models and alignments can also be stored and parsed in later directly from the files.
hmm_7tm = pfamhmmread('pf00002.ls'); %<-- HMMER 2.0 formatted file seed_seqs = fastaread('pf00002.fa'); %<-- FASTA formatted file
Display the names and contents of the first three loaded sequences using the seqdisp command.
seqdisp(seed_seqs([1 2 3]),'row',80)
ans = >Q9YHC6_RANRI/126-382 1 FGAIKTGYTI GHSLS.LISL TAAMIILCIF RKLHCTR.NY IHMHLFMSFI MRAIAVFIK. DIVLF ESG.. .........E 81 SDHCHVGS.. .......... .......... ..VGCKAAMV FFQYCIMANF FWLLVEGLYL HNLLV IS... ..FFSE.KKY 161 FWWYILIGWG APSVFITAWS LAR....... ........VY FED....... ..TGC.WDT. .IESH LW... .WIIKTPILV 241 SILVNFILFI CIIRILVQKL H......SPD VGRNENSQY. .......... .......... ..... ..TRL AKSTLLLIPL 321 FGVHYIM..F AFFPDNFK.. ..VEVKLVFE LILGSF.... QGFVVAVLYC FLNGEV >VIPR1_RAT/140-397 1 YNTVKTGYTI GYSLS.LASL LVAMAILSLF RKLHCTR.NY IHMHLFMSFI LRATAVFIK. DMALF NSG.. .........E 81 IDHCSEAS.. .......... .......... ..VGCKAAVV FFQYCVMANF FWLLVEGLYL YTLLA VS... ..FFSE.RKY 161 FWGYILIGWG VPSVFITIWT VVR....... ........IY FED....... ..FGC.WDTI .INSS LW... .WIIKAPILL 241 SILVNFVLFI CIIRILVQKL R......PPD IGKNDSSPY. .......... .......... ..... ..SRL AKSTLLLIPL 321 FGIHYVM..F AFFPDNFK.. ..AQVKMVFE LVVGSF.... QGFVVAILYC FLNGEV >VIPR_CARAU/100-359 1 FRSVKIGYTI GHSVS.LISL TTAIVILCMS RKLHCTR.NY IHMHLFVSFI LKAIAVFVK. DAVLY DVIQ. .........E 81 SDNCSTAS.. .......... .......... ..VGCKAVIV FFQYCIMASF FWLLVEGLYL HALLA VS... ..FFSE.RKY 161 FWWYILIGWG GPTIFIMAWS FAK....... ........AY FND....... ..VGC.WDII ENSDL FW... .WIIKTPILA 241 SILMNFILFI CIIRILRQKI N......CPD IGRNESNQY. .......... .......... ..... ..SRL AKSTLLLIPL 321 FGINFII..F AFIPENIK.. ..TELRLVFD LILGSF.... QGFVVAVLYC FLNGEV
More information regarding how to store the profile HMM information in a MATLAB® structure is found in the help for hmmprofstruct.
Profile HMM Alignment
To test the profile HMM alignment tool, first erase the periods in sequences used to format the downloaded aligned sequences. Doing this removes the alignment information from the sequences.
seqs = strrep({seed_seqs.Sequence},'.','');
names = {seed_seqs.Header};
Now align all the proteins to the HMM profile.
fprintf('Aligning sequences ') scores = zeros(numel(seqs),1); aligned_seqs = cell(numel(seqs),1); for sn=1:numel(seqs) fprintf('.') [scores(sn),aligned_seqs{sn}]=hmmprofalign(hmm_7tm,seqs{sn}); end fprintf('\n')
Aligning sequences ................................
Next, send the results to the Help Browser to better explore the new multiple alignment. Columns marked with * at the bottom indicate when the model was in a "match" or "delete" state. You can also explore the alignment from the command window; hmmprofmerge with one output argument places the aligned sequences into a char array.
hmmprofmerge(aligned_seqs,names,scores) str = hmmprofmerge(aligned_seqs); str(1:10,1:80)
ans = FGAIKTGYTIGHSLS.LISLTAAMIILCIFRKLHCTR.NYIHMHLFMSFIMRAIAVFIK.DIVLFESG-------- ---E YNTVKTGYTIGYSLS.LASLLVAMAILSLFRKLHCTR.NYIHMHLFMSFILRATAVFIK.DMALFNSG-------- ---E FRSVKIGYTIGHSVS.LISLTTAIVILCMSRKLHCTR.NYIHMHLFVSFILKAIAVFVK.DAVLYDVIQ------- ---E YILVKAIYTLGYSVS.LMSLATGSIILCLFRKLHCTR.NYIHLNLFLSFILRAISVLVK.DDVLYSSS-------- ---G YLSVKALYTVGYSTS.LVTLTTAMVILCRFRKLHCTR.NFIHMNLFVSFMLRAISVFIK.DWILYAEQ-------- ---D LLKLKVMYTVGYSSS.LVMLLVALGILCAFRRLHCTR.NYIHMHLFLSFILRALSNFIK.DAVLFSSD-------- ---D LSTLKQLYTAGYATS.LISLITAVIIFTCFRKFHCTR.NYIHINLFVSFILRATAVFIK.DAVLFSDE-------- ---T FSTVKIIYTTGHSIS.IVALCVAIAILVALRRLHCPR.NYIHTQLFATFILKASAVFLK.DAAIFQGD-------- ---S FERLYVMYTVGYSIS.FGSLAVAILIIGYFRRLHCTR.NYIHMHLFVSFMLRATSIFVK.DRVVHAHIGVKELESL IMQD FDRLGMIYTVGYSVS.LASLTVAVLILAYFRRLHCTR.NYIHMHLFLSFMLRAVSIFVK.DAVLYSGATLDEAERL T-EE
Looking for Evidence with Sequence Comparison
Having a profile HHM which describes this family has several advantages over plain sequence comparison. Suppose that you have a new oligonucleotide that you want to relate to the 7-transmembrane receptor family. For this example, get a nucleotide sequence from NCBI and translate it to an amino acid sequence.
accs='NM_175642'; nts=getgenbank(accs,'sequenceonly',true);
If you don't have a live web connection you can load the data from the GenBank® formatted file: (uncomment the following lines)
% GBOUT = genbankread('nm175642.txt'); % nts = GBOUT.Sequence; NM175642=nt2aa(nts,'frame',1); NM175642(NM175642 == '*') = []; seqdisp(NM175642,'row',80)
ans = 1 GSGRWIPPPP PLLFFFPFFP SAFRLHALFF FQSILIFEAE PGGKMGNGDW DRRGVSPSLL LGRV PVLVSI KSLPELSEWK 81 AKHRMKAVRN LLIYIFSTYL LVMFGFNAAQ DFWCSTLVKG VIYGSYSVSE MFPKNFTNCT WTLE NPDPTK YSIYLKFSKK 161 DLSCSNFSLL AYQFDHFSHE KIKDLLRKNH SIMQLCSSKN AFVFLQYDKN FIQIRRVFPT DFPG LQKKVE EDQKSFFEFL 241 VLNKVSPSQF GCHVLCTWLE SCLKSENGRT ESCGIMYTKC TCPQHLGEWG IDDQSLVLLN NVVL PLNEQT EGCLTQELQT 321 TQVCNLTREA KRPPKEEFGM MGDHTIKSQR PRSVHEKRVP QEQADAAKFM AQTGESGVEE WSQW SACSVT CGQGSQVRTR 401 TCVSPYGTHC SGPLRESRVC NNTALCPVHG VWEEWSPWSL CSFTCGRGQR TRTRSCTPPQ YGGR PCEGPE THHKPCNIAL 481 CPVDGQWQEW SSWSHCSVTC SNGTQQRSRQ CTAAAHGGSE CRGPWAESRE CYNPECTANG QWNQ WGHWSG CSKSCDGGWE 561 RRMRTCQGAA VTGQQCEGTG EEVRRCSEQR CPAPYEICPE DYLISMVWKR TPAGDLAFNQ CPLN ATGTTS RRCSLSLHGV 641 ASWEQPSFAR CISNEYRHLQ HSIKEHLAKG QRMLAGDGMS QVTKTLLDLT QRKNFYAGDL LVSV EILRNV TDTFKRASYI 721 PASDGVQNFF QIVSNLLDEE NKEKWEDAQQ IYPGSIELMQ VIEDFIHIVG MGMMDFQNSY LMTG NVVASI QKLPAASVLT 801 DINFPMKGRK GMVDWARNSE DRVVIPKSIF TPVSSKELDE SSVFVLGAVL YKNLDLILPT LRNY TVVNSK VIVVTIRPEP 881 KTTDSFLEIE LAHLANGTLN PYCVLWDDSK SNESLGTWST QGCKTVLTDA SHTKCLCDRL STFA ILAQQP REIVMESSGT 961 PSVTLIVGSG LSCLALITLA VVYAALWRYI RSERSIILIN FCLSIISSNI LILVGQTQTH NKSI CTTTTA FLHFFFLASF 1041 CWVLTEAWQS YMAVTGKIRT RLIRKRFLCL GWGLPALVVA TSVGFTRTKG YGTDHYCWLS LEGG LLYAFV GPAAAVVLVN 1121 MVIGILVFNK LVSRDGILDK KLKHRAGQMS EPHSGLTLKC AKCGVVSTTA LSATTASNAM ASLW SSCVVL PLLALTWMSA 1201 VLAMTDKRSI LFQILFAVFD SLQGFVIVMV HCILRREVQD AFRCRLRNCQ DPINADSSSS FPNG HAQIMT DFEKDVDIAC 1281 RSVLHKDIGP CRAATITGTL SRISLNDDEE EKGTNPEGLS YSTLPGNVIS KVIIQQPTGL HMPM SMNELS NPCLKKENTE 1361 LRRTVYLCTD DNLRGADMDI VHPQERMMES DYIVMPRSSV STQPSMKEES KMNIGMETLP HERL LHYKVN PEFNMNPPVM 1441 DQFNMNLDQH LAPQEHMQNL PFEPRTAVKN FMASELDDNV GLSRSETGST ISMSSLERRK SRYS DLDFEK VMHTRKRHME 1521 LFQELNQKFQ TLDRFRDIPN TSSMENPAPN KNPWDTFKPP SEYQHYTTIN VLDTEAKDTL ELRP AEWEKC LNLPLDVQEG 1601 DFQTEVGNQK ELRRHSQIHH DLGSLTFPPG QCDFPLSGPF LCQHQCFLMA ISHWSGWRET AKCT VLTTPC CKYPWNGFVR 1681 SLINLKQGFH VVTASCGLVF RKLHHEAQCI YLCSFSLQSV WPLLHFLLYN IKAKFLSLNE CLLS YILHCF KCNKVIISLL 1761 YEYISHLYYC SFLKALSSFL CCSCVNINVV WCKTFVYISV FCILFWICTE IYIFIKFSFS I
Try to relate it to the first sequence in the 7tm_2 multiple alignment using dot plots. Is there any evidence that they are related?
Q9YHC6 = seqs{1};
seqdotplot(Q9YHC6,NM175642,7,2)
ylabel(names{1}, 'interpreter', 'none'); xlabel('NM 175642');
set(gca,'Position',[0.07,0.02,0.90,0.70]);
Warning: Match matrix has more points than available screen pixels.
Scaling image by factors of 2 in X and 3 in Y
Dot plots only take exact matches into account. The Smith-Waterman algorithm can make use of scoring matrices. Scoring matrices can capture the probability of substitution of symbols. The sequences in this example are known to be only distantly related, so BLOSUM30 is a good choice for the scoring matrix.
sc_aa_affine = swalign(NM175642,Q9YHC6,'ScoringMatrix','blosum30',... 'gapopen',5,'extendgap',3,'showscore',true)
sc_aa_affine = 82.2000
Is either of these two examples enough evidence to affirm that these sequences are related? One way to test this is to randomly create a fake sequence with the same distribution of amino acids and see how it aligns to the family. Notice that the scores from the local alignment are not significantly lower than those for the real protein (75.2 vs 82.2).
rand('state',0); % initialize the state of the random number generator fakeSeq = randseq(1000,'FROMSTRUCTURE',aacount(Q9YHC6)); sc_fk_affine = swalign(fakeSeq,Q9YHC6,'ScoringMatrix','blosum30',... 'gapopen',5,'extendgap',3,'showscore',true)
sc_fk_affine = 75.2000
Align both sequences to the family using the profile HMM and observe the difference, now the score of aligning the target sequence to the family profile is significantly larger than aligning the fake sequence.
sc_aa_hmm = hmmprofalign(hmm_7tm,NM175642) sc_fk_hmm = hmmprofalign(hmm_7tm,fakeSeq)
sc_aa_hmm = 336.8018 sc_fk_hmm = -168.2931
Exploring Profile HMM Alignment Options
You can compare the recent alignments graphically using the 'showscore' option to the hmmprofalign function.
Display NM_175642 aligned to the 7tm_2 family.
aa = NM175642; hmmprofalign(hmm_7tm,aa,'showscore',true); title('log-odds score for best path: NM 175642');
Display the "fake" sequence aligned to the 7tm_2 family.
hmmprofalign(hmm_7tm,fakeSeq,'showscore',true); title('log-odds score for best path: fake sequence');
Display NM_175642 globally aligned to the 7tm_2 family.
[sc_aa_hmm,align,ptrs] = hmmprofalign(hmm_7tm,aa); seq = aa(min(ptrs):max(ptrs)); hmmprofalign(hmm_7tm,seq,'showscore',true); title('log-odds score for best path: NM 175642 aligned globally');
Align tandemly repeated domains.
repeats = randseq(1000,'FROMSTRUCTURE',aacount(aa)); %artificial example repeats(201:200+length(aa(876:1154))) = aa(876:1154); repeats(501:500+length(aa(876:1154))) = aa(876:1154); repeats(701:700+length(aa(876:1154))) = aa(876:1154); hmmprofalign(hmm_7tm,repeats,'showscore',true); title('log-odds score for best path: NM 175642 tandem repeats');
Searching for Fragment Domains
There are two options for searching for fragment domains: (1) download the respective profile HMM for fragment alignments from the PFAM database,
hmm_7tm_f = gethmmprof(2,'mode','fs')
Warning: Consensus annotation not implemented in this parser,
consensus annotation characters will be ignored.
hmm_7tm_f =
Name: '7tm_2'
PfamAccessionNumber: 'PF00002.16'
ModelDescription: '7 transmembrane receptor (Secretin family)'
ModelLength: 293
Alphabet: 'AA'
MatchEmission: [293x20 double]
InsertEmission: [293x20 double]
NullEmission: [1x20 double]
BeginX: [294x1 double]
MatchX: [292x4 double]
InsertX: [292x2 double]
DeleteX: [292x2 double]
FlankingInsertX: [2x2 double]
LoopX: [2x2 double]
NullX: [2x1 double]
or (2) manually activate the B->M and M->E transition probabilities:
hmm_7tm_f = hmm_7tm; hmm_7tm_f.BeginX(3:end)=.002; hmm_7tm_f.MatchX(1:end-1,4)=.002;
Create a random sequence, or fragment model, with a small insertion of the NM_175642 protein:
fragment = randseq(1000,'FROMSTRUCTURE',aacount(aa));
fragment(501:550) = aa(901:950);
Try aligning both models, the global and fragment model:
hmmprofalign(hmm_7tm,fragment,'showscore',true); title('log-odds score for best path: PF00002 global '); hmmprofalign(hmm_7tm_f,fragment,'showscore',true); title('log-odds score for best path: PF00002 fragment domains');
Exploring the Profile HMMs
The function showhmmprof is an interactive tool to explore the profile HMM. Try right and left mouse clicks over the model figures. There are three plots for each model: (1) the symbol emission probabilities in the Match states, (2) the symbol emission probabilities in the Insert states, and (3) the Transition probabilities.
showhmmprof(hmm_7tm,'scale','logodds')
An alternative method to explore a profile HMM is by creating a sequence logo from the multiple alignment. A sequence logo displays the frequency of bases found at each position within a given region, usually for a binding site. Using the hmm_7tm sequences, consider the portion of the Parathyroid hormone/parathyroid hormone-related peptide receptor (precursor) found at the n-terminus of the PTRR_Human sequence. seqlogo allows a quick visual comparison of how well this region is conserved across the 7tm family.
seqlogo(str,'startat',1,'endat',20,'alphabet','AA')
Profile Estimation
Finally, profile HMMs can also be estimated from a multiple alignment, as new sequences related to the family are found, it is possible to re-estimate the model parameters.
hmm_7tm_new = hmmprofestimate(hmm_7tm,str)
hmm_7tm_new =
Name: '7tm_2'
PfamAccessionNumber: 'PF00002.15'
ModelDescription: '7 transmembrane receptor (Secretin family)'
ModelLength: 293
Alphabet: 'AA'
MatchEmission: [293x20 double]
InsertEmission: [293x20 double]
NullEmission: [1x20 double]
BeginX: [294x1 double]
MatchX: [292x4 double]
InsertX: [292x2 double]
DeleteX: [292x2 double]
FlankingInsertX: [2x2 double]
LoopX: [2x2 double]
NullX: [2x1 double]
In case your sequences are not pre-aligned, you can also utilize multialign before estimating a new HMM profile. It is possible to refine the HMM profile by re-aligning the sequences to the model and re-estimating the model iteratively until you converge to a locally optimal model.
aligned_seqs = multialign(seqs); hmm_7tm_ma = hmmprofestimate(hmmprofstruct(270),aligned_seqs) showhmmprof(hmm_7tm_ma,'scale','logodds') close; close; % close insertion emission prob. and transition prob.
hmm_7tm_ma =
ModelLength: 270
Alphabet: 'AA'
MatchEmission: [270x20 double]
InsertEmission: [270x20 double]
NullEmission: [1x20 double]
BeginX: [271x1 double]
MatchX: [269x4 double]
InsertX: [269x2 double]
DeleteX: [269x2 double]
FlankingInsertX: [2x2 double]
LoopX: [2x2 double]
NullX: [2x1 double]
Align all sequences to the new model and show them in the Help Browser.
fprintf('Aligning sequences ') scores = zeros(numel(seqs),1); aligned_seqs = cell(numel(seqs),1); for sn=1:numel(seqs) fprintf('.') [scores(sn),aligned_seqs{sn}]=hmmprofalign(hmm_7tm_ma,seqs{sn} ); end fprintf('\n') hmmprofmerge(aligned_seqs,names,scores) str = hmmprofmerge(aligned_seqs); str(1:10,1:80)
Aligning sequences ................................ ans = FGAIKTGYTIGHSLSLISLTAAMIILCIF.RKLHCTRNYIHMHLFMSFIMRAIAVFIKDIVLFES--.GESDHCH. .... YNTVKTGYTIGYSLSLASLLVAMAILSLF.RKLHCTRNYIHMHLFMSFILRATAVFIKDMALFNS--.GEIDHCS. .... FRSVKIGYTIGHSVSLISLTTAIVILCMS.RKLHCTRNYIHMHLFVSFILKAIAVFVKDAVLYDVIQ.-ESDNCS. .... YILVKAIYTLGYSVSLMSLATGSIILCLF.RKLHCTRNYIHLNLFLSFILRAISVLVKDDVLYSS--.SGTLHCP. .... YLSVKALYTVGYSTSLVTLTTAMVILCRF.RKLHCTRNFIHMNLFVSFMLRAISVFIKDWILYAE--.QDSSHCF. .... LLKLKVMYTVGYSSSLVMLLVALGILCAF.RRLHCTRNYIHMHLFLSFILRALSNFIKDAVLFSS--.DDAIHCD. .... LSTLKQLYTAGYATSLISLITAVIIFTCF.RKFHCTRNYIHINLFVSFILRATAVFIKDAVLFSD--.ETQNHCL. .... FSTVKIIYTTGHSISIVALCVAIAILVAL.RRLHCPRNYIHTQLFATFILKASAVFLKDAAIFQG--.DSTDHCS. .... FERLYVMYTVGYSISFGSLAVAILIIGYF.RRLHCTRNYIHMHLFVSFMLRATSIFVKDRVVHAHIGvKELESLI. .... FDRLGMIYTVGYSVSLASLTVAVLILAYF.RRLHCTRNYIHMHLFLSFMLRAVSIFVKDAVLYSGATlDEAERLT. ....
Store