Skip to Main Content Skip to Search
Home |   Select Country  Choose Country  |  Contact Us  |  Cart Store 
Create Account | Log In
Products & Services Solutions Academia Support User Community Company
spacer spacer spacer spacer spacer spacer

 

Bioinformatics Toolbox 3.4

Using HMMs for Profile Analysis of a Protein Family

Sequence comparison is a key tool in bioinformatics. The objective of a HMM profile is to statistically model patterns in biological sequences by identifying combinations of matches, in-dels, and gaps in the alignment of a query sequence to a profile model. HMM profile analysis can be used for multiple sequence alignment, for a database search, to analyze sequence composition and pattern segmentation, and to predict protein structure and locate genes by predicting open reading frames. This demonstration shows how HMM profiles are used to characterize protein families.

Contents

Accessing PFAM Databases

Starting with an already built HMM of a protein family, retrieve the model for the well-known 7-fold transmembrane receptor from the Sanger Institute database. The PFAM key number is PF00002. Also retrieve the pre-aligned sequences used to train this model. More information about the PFAM database can be found at http://www.sanger.ac.uk/. If the Sanger site is not accessible, mirror sites may also be accessed.

hmm_7tm = gethmmprof(2)
seed_seqs = gethmmalignment(2,'type','seed');
Warning: Consensus annotation not implemented in this parser,
consensus annotation characters will be ignored.

hmm_7tm =

                   Name: '7tm_2'
    PfamAccessionNumber: 'PF00002.16'
       ModelDescription: '7 transmembrane receptor (Secretin family)'
            ModelLength: 293
               Alphabet: 'AA'
          MatchEmission: [293x20 double]
         InsertEmission: [293x20 double]
           NullEmission: [1x20 double]
                 BeginX: [294x1 double]
                 MatchX: [292x4 double]
                InsertX: [292x2 double]
                DeleteX: [292x2 double]
        FlankingInsertX: [2x2 double]
                  LoopX: [2x2 double]
                  NullX: [2x1 double]

Models and alignments can also be stored and parsed in later directly from the files.

hmm_7tm = pfamhmmread('pf00002.ls');     %<-- HMMER 2.0 formatted file
seed_seqs = fastaread('pf00002.fa');     %<-- FASTA formatted file

Display the names and contents of the first three loaded sequences using the seqdisp command.

seqdisp(seed_seqs([1 2 3]),'row',80)
ans =

>Q9YHC6_RANRI/126-382
  1  FGAIKTGYTI GHSLS.LISL TAAMIILCIF RKLHCTR.NY IHMHLFMSFI MRAIAVFIK. DIVLF
ESG.. .........E
 81  SDHCHVGS.. .......... .......... ..VGCKAAMV FFQYCIMANF FWLLVEGLYL HNLLV
IS... ..FFSE.KKY
161  FWWYILIGWG APSVFITAWS LAR....... ........VY FED....... ..TGC.WDT. .IESH
LW... .WIIKTPILV
241  SILVNFILFI CIIRILVQKL H......SPD VGRNENSQY. .......... .......... .....
..TRL AKSTLLLIPL
321  FGVHYIM..F AFFPDNFK.. ..VEVKLVFE LILGSF.... QGFVVAVLYC FLNGEV

>VIPR1_RAT/140-397
  1  YNTVKTGYTI GYSLS.LASL LVAMAILSLF RKLHCTR.NY IHMHLFMSFI LRATAVFIK. DMALF
NSG.. .........E
 81  IDHCSEAS.. .......... .......... ..VGCKAAVV FFQYCVMANF FWLLVEGLYL YTLLA
VS... ..FFSE.RKY
161  FWGYILIGWG VPSVFITIWT VVR....... ........IY FED....... ..FGC.WDTI .INSS
LW... .WIIKAPILL
241  SILVNFVLFI CIIRILVQKL R......PPD IGKNDSSPY. .......... .......... .....
..SRL AKSTLLLIPL
321  FGIHYVM..F AFFPDNFK.. ..AQVKMVFE LVVGSF.... QGFVVAILYC FLNGEV

>VIPR_CARAU/100-359
  1  FRSVKIGYTI GHSVS.LISL TTAIVILCMS RKLHCTR.NY IHMHLFVSFI LKAIAVFVK. DAVLY
DVIQ. .........E
 81  SDNCSTAS.. .......... .......... ..VGCKAVIV FFQYCIMASF FWLLVEGLYL HALLA
VS... ..FFSE.RKY
161  FWWYILIGWG GPTIFIMAWS FAK....... ........AY FND....... ..VGC.WDII ENSDL
FW... .WIIKTPILA
241  SILMNFILFI CIIRILRQKI N......CPD IGRNESNQY. .......... .......... .....
..SRL AKSTLLLIPL
321  FGINFII..F AFIPENIK.. ..TELRLVFD LILGSF.... QGFVVAVLYC FLNGEV

More information regarding how to store the profile HMM information in a MATLAB® structure is found in the help for hmmprofstruct.

Profile HMM Alignment

To test the profile HMM alignment tool, first erase the periods in sequences used to format the downloaded aligned sequences. Doing this removes the alignment information from the sequences.

seqs = strrep({seed_seqs.Sequence},'.','');
names = {seed_seqs.Header};

Now align all the proteins to the HMM profile.

fprintf('Aligning sequences ')
scores = zeros(numel(seqs),1);
aligned_seqs = cell(numel(seqs),1);
for sn=1:numel(seqs)
    fprintf('.')
    [scores(sn),aligned_seqs{sn}]=hmmprofalign(hmm_7tm,seqs{sn});
end
fprintf('\n')
Aligning sequences ................................

Next, send the results to the Help Browser to better explore the new multiple alignment. Columns marked with * at the bottom indicate when the model was in a "match" or "delete" state. You can also explore the alignment from the command window; hmmprofmerge with one output argument places the aligned sequences into a char array.

hmmprofmerge(aligned_seqs,names,scores)
str = hmmprofmerge(aligned_seqs);
str(1:10,1:80)
ans =

FGAIKTGYTIGHSLS.LISLTAAMIILCIFRKLHCTR.NYIHMHLFMSFIMRAIAVFIK.DIVLFESG--------
---E
YNTVKTGYTIGYSLS.LASLLVAMAILSLFRKLHCTR.NYIHMHLFMSFILRATAVFIK.DMALFNSG--------
---E
FRSVKIGYTIGHSVS.LISLTTAIVILCMSRKLHCTR.NYIHMHLFVSFILKAIAVFVK.DAVLYDVIQ-------
---E
YILVKAIYTLGYSVS.LMSLATGSIILCLFRKLHCTR.NYIHLNLFLSFILRAISVLVK.DDVLYSSS--------
---G
YLSVKALYTVGYSTS.LVTLTTAMVILCRFRKLHCTR.NFIHMNLFVSFMLRAISVFIK.DWILYAEQ--------
---D
LLKLKVMYTVGYSSS.LVMLLVALGILCAFRRLHCTR.NYIHMHLFLSFILRALSNFIK.DAVLFSSD--------
---D
LSTLKQLYTAGYATS.LISLITAVIIFTCFRKFHCTR.NYIHINLFVSFILRATAVFIK.DAVLFSDE--------
---T
FSTVKIIYTTGHSIS.IVALCVAIAILVALRRLHCPR.NYIHTQLFATFILKASAVFLK.DAAIFQGD--------
---S
FERLYVMYTVGYSIS.FGSLAVAILIIGYFRRLHCTR.NYIHMHLFVSFMLRATSIFVK.DRVVHAHIGVKELESL
IMQD
FDRLGMIYTVGYSVS.LASLTVAVLILAYFRRLHCTR.NYIHMHLFLSFMLRAVSIFVK.DAVLYSGATLDEAERL
T-EE

Looking for Evidence with Sequence Comparison

Having a profile HHM which describes this family has several advantages over plain sequence comparison. Suppose that you have a new oligonucleotide that you want to relate to the 7-transmembrane receptor family. For this example, get a nucleotide sequence from NCBI and translate it to an amino acid sequence.

accs='NM_175642';
nts=getgenbank(accs,'sequenceonly',true);

If you don't have a live web connection you can load the data from the GenBank® formatted file: (uncomment the following lines)

% GBOUT = genbankread('nm175642.txt');
% nts = GBOUT.Sequence;

NM175642=nt2aa(nts,'frame',1);
NM175642(NM175642 == '*') = [];
seqdisp(NM175642,'row',80)
ans =

   1  GSGRWIPPPP PLLFFFPFFP SAFRLHALFF FQSILIFEAE PGGKMGNGDW DRRGVSPSLL LGRV
PVLVSI KSLPELSEWK
  81  AKHRMKAVRN LLIYIFSTYL LVMFGFNAAQ DFWCSTLVKG VIYGSYSVSE MFPKNFTNCT WTLE
NPDPTK YSIYLKFSKK
 161  DLSCSNFSLL AYQFDHFSHE KIKDLLRKNH SIMQLCSSKN AFVFLQYDKN FIQIRRVFPT DFPG
LQKKVE EDQKSFFEFL
 241  VLNKVSPSQF GCHVLCTWLE SCLKSENGRT ESCGIMYTKC TCPQHLGEWG IDDQSLVLLN NVVL
PLNEQT EGCLTQELQT
 321  TQVCNLTREA KRPPKEEFGM MGDHTIKSQR PRSVHEKRVP QEQADAAKFM AQTGESGVEE WSQW
SACSVT CGQGSQVRTR
 401  TCVSPYGTHC SGPLRESRVC NNTALCPVHG VWEEWSPWSL CSFTCGRGQR TRTRSCTPPQ YGGR
PCEGPE THHKPCNIAL
 481  CPVDGQWQEW SSWSHCSVTC SNGTQQRSRQ CTAAAHGGSE CRGPWAESRE CYNPECTANG QWNQ
WGHWSG CSKSCDGGWE
 561  RRMRTCQGAA VTGQQCEGTG EEVRRCSEQR CPAPYEICPE DYLISMVWKR TPAGDLAFNQ CPLN
ATGTTS RRCSLSLHGV
 641  ASWEQPSFAR CISNEYRHLQ HSIKEHLAKG QRMLAGDGMS QVTKTLLDLT QRKNFYAGDL LVSV
EILRNV TDTFKRASYI
 721  PASDGVQNFF QIVSNLLDEE NKEKWEDAQQ IYPGSIELMQ VIEDFIHIVG MGMMDFQNSY LMTG
NVVASI QKLPAASVLT
 801  DINFPMKGRK GMVDWARNSE DRVVIPKSIF TPVSSKELDE SSVFVLGAVL YKNLDLILPT LRNY
TVVNSK VIVVTIRPEP
 881  KTTDSFLEIE LAHLANGTLN PYCVLWDDSK SNESLGTWST QGCKTVLTDA SHTKCLCDRL STFA
ILAQQP REIVMESSGT
 961  PSVTLIVGSG LSCLALITLA VVYAALWRYI RSERSIILIN FCLSIISSNI LILVGQTQTH NKSI
CTTTTA FLHFFFLASF
1041  CWVLTEAWQS YMAVTGKIRT RLIRKRFLCL GWGLPALVVA TSVGFTRTKG YGTDHYCWLS LEGG
LLYAFV GPAAAVVLVN
1121  MVIGILVFNK LVSRDGILDK KLKHRAGQMS EPHSGLTLKC AKCGVVSTTA LSATTASNAM ASLW
SSCVVL PLLALTWMSA
1201  VLAMTDKRSI LFQILFAVFD SLQGFVIVMV HCILRREVQD AFRCRLRNCQ DPINADSSSS FPNG
HAQIMT DFEKDVDIAC
1281  RSVLHKDIGP CRAATITGTL SRISLNDDEE EKGTNPEGLS YSTLPGNVIS KVIIQQPTGL HMPM
SMNELS NPCLKKENTE
1361  LRRTVYLCTD DNLRGADMDI VHPQERMMES DYIVMPRSSV STQPSMKEES KMNIGMETLP HERL
LHYKVN PEFNMNPPVM
1441  DQFNMNLDQH LAPQEHMQNL PFEPRTAVKN FMASELDDNV GLSRSETGST ISMSSLERRK SRYS
DLDFEK VMHTRKRHME
1521  LFQELNQKFQ TLDRFRDIPN TSSMENPAPN KNPWDTFKPP SEYQHYTTIN VLDTEAKDTL ELRP
AEWEKC LNLPLDVQEG
1601  DFQTEVGNQK ELRRHSQIHH DLGSLTFPPG QCDFPLSGPF LCQHQCFLMA ISHWSGWRET AKCT
VLTTPC CKYPWNGFVR
1681  SLINLKQGFH VVTASCGLVF RKLHHEAQCI YLCSFSLQSV WPLLHFLLYN IKAKFLSLNE CLLS
YILHCF KCNKVIISLL
1761  YEYISHLYYC SFLKALSSFL CCSCVNINVV WCKTFVYISV FCILFWICTE IYIFIKFSFS I

Try to relate it to the first sequence in the 7tm_2 multiple alignment using dot plots. Is there any evidence that they are related?

Q9YHC6 = seqs{1};
seqdotplot(Q9YHC6,NM175642,7,2)
ylabel(names{1}, 'interpreter', 'none'); xlabel('NM 175642');
set(gca,'Position',[0.07,0.02,0.90,0.70]);
Warning: Match matrix has more points than available screen pixels.
         Scaling image by factors of 2 in X and 3 in Y

Dot plots only take exact matches into account. The Smith-Waterman algorithm can make use of scoring matrices. Scoring matrices can capture the probability of substitution of symbols. The sequences in this example are known to be only distantly related, so BLOSUM30 is a good choice for the scoring matrix.

sc_aa_affine = swalign(NM175642,Q9YHC6,'ScoringMatrix','blosum30',...
                       'gapopen',5,'extendgap',3,'showscore',true)
sc_aa_affine =

   82.2000

Is either of these two examples enough evidence to affirm that these sequences are related? One way to test this is to randomly create a fake sequence with the same distribution of amino acids and see how it aligns to the family. Notice that the scores from the local alignment are not significantly lower than those for the real protein (75.2 vs 82.2).

rand('state',0);  % initialize the state of the random number generator
fakeSeq = randseq(1000,'FROMSTRUCTURE',aacount(Q9YHC6));
sc_fk_affine = swalign(fakeSeq,Q9YHC6,'ScoringMatrix','blosum30',...
                       'gapopen',5,'extendgap',3,'showscore',true)
sc_fk_affine =

   75.2000

Align both sequences to the family using the profile HMM and observe the difference, now the score of aligning the target sequence to the family profile is significantly larger than aligning the fake sequence.

sc_aa_hmm = hmmprofalign(hmm_7tm,NM175642)
sc_fk_hmm = hmmprofalign(hmm_7tm,fakeSeq)
sc_aa_hmm =

  336.8018


sc_fk_hmm =

 -168.2931

Exploring Profile HMM Alignment Options

You can compare the recent alignments graphically using the 'showscore' option to the hmmprofalign function.

Display NM_175642 aligned to the 7tm_2 family.

aa = NM175642;
hmmprofalign(hmm_7tm,aa,'showscore',true);
title('log-odds score for best path: NM 175642');

Display the "fake" sequence aligned to the 7tm_2 family.

hmmprofalign(hmm_7tm,fakeSeq,'showscore',true);
title('log-odds score for best path: fake sequence');

Display NM_175642 globally aligned to the 7tm_2 family.

[sc_aa_hmm,align,ptrs] = hmmprofalign(hmm_7tm,aa);
seq = aa(min(ptrs):max(ptrs));
hmmprofalign(hmm_7tm,seq,'showscore',true);
title('log-odds score for best path: NM 175642 aligned globally');

Align tandemly repeated domains.

repeats = randseq(1000,'FROMSTRUCTURE',aacount(aa)); %artificial example
repeats(201:200+length(aa(876:1154))) = aa(876:1154);
repeats(501:500+length(aa(876:1154))) = aa(876:1154);
repeats(701:700+length(aa(876:1154))) = aa(876:1154);
hmmprofalign(hmm_7tm,repeats,'showscore',true);
title('log-odds score for best path: NM 175642 tandem repeats');

Searching for Fragment Domains

There are two options for searching for fragment domains: (1) download the respective profile HMM for fragment alignments from the PFAM database,

hmm_7tm_f = gethmmprof(2,'mode','fs')
Warning: Consensus annotation not implemented in this parser,
consensus annotation characters will be ignored.

hmm_7tm_f =

                   Name: '7tm_2'
    PfamAccessionNumber: 'PF00002.16'
       ModelDescription: '7 transmembrane receptor (Secretin family)'
            ModelLength: 293
               Alphabet: 'AA'
          MatchEmission: [293x20 double]
         InsertEmission: [293x20 double]
           NullEmission: [1x20 double]
                 BeginX: [294x1 double]
                 MatchX: [292x4 double]
                InsertX: [292x2 double]
                DeleteX: [292x2 double]
        FlankingInsertX: [2x2 double]
                  LoopX: [2x2 double]
                  NullX: [2x1 double]

or (2) manually activate the B->M and M->E transition probabilities:

hmm_7tm_f = hmm_7tm;
hmm_7tm_f.BeginX(3:end)=.002;
hmm_7tm_f.MatchX(1:end-1,4)=.002;

Create a random sequence, or fragment model, with a small insertion of the NM_175642 protein:

fragment = randseq(1000,'FROMSTRUCTURE',aacount(aa));
fragment(501:550) = aa(901:950);

Try aligning both models, the global and fragment model:

hmmprofalign(hmm_7tm,fragment,'showscore',true);
title('log-odds score for best path: PF00002 global ');
hmmprofalign(hmm_7tm_f,fragment,'showscore',true);
title('log-odds score for best path: PF00002 fragment domains');

Exploring the Profile HMMs

The function showhmmprof is an interactive tool to explore the profile HMM. Try right and left mouse clicks over the model figures. There are three plots for each model: (1) the symbol emission probabilities in the Match states, (2) the symbol emission probabilities in the Insert states, and (3) the Transition probabilities.

showhmmprof(hmm_7tm,'scale','logodds')

An alternative method to explore a profile HMM is by creating a sequence logo from the multiple alignment. A sequence logo displays the frequency of bases found at each position within a given region, usually for a binding site. Using the hmm_7tm sequences, consider the portion of the Parathyroid hormone/parathyroid hormone-related peptide receptor (precursor) found at the n-terminus of the PTRR_Human sequence. seqlogo allows a quick visual comparison of how well this region is conserved across the 7tm family.

seqlogo(str,'startat',1,'endat',20,'alphabet','AA')

Profile Estimation

Finally, profile HMMs can also be estimated from a multiple alignment, as new sequences related to the family are found, it is possible to re-estimate the model parameters.

hmm_7tm_new = hmmprofestimate(hmm_7tm,str)
hmm_7tm_new =

                   Name: '7tm_2'
    PfamAccessionNumber: 'PF00002.15'
       ModelDescription: '7 transmembrane receptor (Secretin family)'
            ModelLength: 293
               Alphabet: 'AA'
          MatchEmission: [293x20 double]
         InsertEmission: [293x20 double]
           NullEmission: [1x20 double]
                 BeginX: [294x1 double]
                 MatchX: [292x4 double]
                InsertX: [292x2 double]
                DeleteX: [292x2 double]
        FlankingInsertX: [2x2 double]
                  LoopX: [2x2 double]
                  NullX: [2x1 double]

In case your sequences are not pre-aligned, you can also utilize multialign before estimating a new HMM profile. It is possible to refine the HMM profile by re-aligning the sequences to the model and re-estimating the model iteratively until you converge to a locally optimal model.

aligned_seqs  = multialign(seqs);
hmm_7tm_ma = hmmprofestimate(hmmprofstruct(270),aligned_seqs)
showhmmprof(hmm_7tm_ma,'scale','logodds')
close; close; % close insertion emission prob. and transition prob.
hmm_7tm_ma =

        ModelLength: 270
           Alphabet: 'AA'
      MatchEmission: [270x20 double]
     InsertEmission: [270x20 double]
       NullEmission: [1x20 double]
             BeginX: [271x1 double]
             MatchX: [269x4 double]
            InsertX: [269x2 double]
            DeleteX: [269x2 double]
    FlankingInsertX: [2x2 double]
              LoopX: [2x2 double]
              NullX: [2x1 double]

Align all sequences to the new model and show them in the Help Browser.

fprintf('Aligning sequences ')
scores = zeros(numel(seqs),1);
aligned_seqs = cell(numel(seqs),1);
for sn=1:numel(seqs)
    fprintf('.')
    [scores(sn),aligned_seqs{sn}]=hmmprofalign(hmm_7tm_ma,seqs{sn}
);
end
fprintf('\n')

hmmprofmerge(aligned_seqs,names,scores)
str = hmmprofmerge(aligned_seqs);
str(1:10,1:80)
Aligning sequences ................................

ans =

FGAIKTGYTIGHSLSLISLTAAMIILCIF.RKLHCTRNYIHMHLFMSFIMRAIAVFIKDIVLFES--.GESDHCH.
....
YNTVKTGYTIGYSLSLASLLVAMAILSLF.RKLHCTRNYIHMHLFMSFILRATAVFIKDMALFNS--.GEIDHCS.
....
FRSVKIGYTIGHSVSLISLTTAIVILCMS.RKLHCTRNYIHMHLFVSFILKAIAVFVKDAVLYDVIQ.-ESDNCS.
....
YILVKAIYTLGYSVSLMSLATGSIILCLF.RKLHCTRNYIHLNLFLSFILRAISVLVKDDVLYSS--.SGTLHCP.
....
YLSVKALYTVGYSTSLVTLTTAMVILCRF.RKLHCTRNFIHMNLFVSFMLRAISVFIKDWILYAE--.QDSSHCF.
....
LLKLKVMYTVGYSSSLVMLLVALGILCAF.RRLHCTRNYIHMHLFLSFILRALSNFIKDAVLFSS--.DDAIHCD.
....
LSTLKQLYTAGYATSLISLITAVIIFTCF.RKFHCTRNYIHINLFVSFILRATAVFIKDAVLFSD--.ETQNHCL.
....
FSTVKIIYTTGHSISIVALCVAIAILVAL.RRLHCPRNYIHTQLFATFILKASAVFLKDAAIFQG--.DSTDHCS.
....
FERLYVMYTVGYSISFGSLAVAILIIGYF.RRLHCTRNYIHMHLFVSFMLRATSIFVKDRVVHAHIGvKELESLI.
....
FDRLGMIYTVGYSVSLASLTVAVLILAYF.RRLHCTRNYIHMHLFLSFMLRAVSIFVKDAVLYSGATlDEAERLT.
....

Suggest an enhancement for this demo.

Contact sales
Free technical kit
Trial software
E-mail this page

Get Pricing and
Licensing Options