Main Content

trainHMMEntityModel

Train HMM-based model for named entity recognition (NER)

Since R2023a

    Description

    Use the trainHMMEntityModel function to train a model for named entity recognition (NER) that is based on a hidden Markov model (HMM).

    The addDependencyDetails function automatically detects person names, locations, organizations, and other named entities in text. If you want to train a custom model that predicts different tags, or train a model using your own data, then you can use the trainHMMEntityModel function.

    example

    mdl = trainHMMEntityModel(tbl) trains a HMM-based model for named entity recognition using the token and entity information in specified table.

    mdl = trainHMMEntityModel(tokens,entities) trains the model using the specified tokens and corresponding labels.

    mdl = trainHMMEntityModel(___,NonEntity=name) also specifies the class name to assign to tokens that are not named entities.

    Examples

    collapse all

    Read the example entity data from exampleEntities.csv into a table.

    tbl = readtable("exampleEntities.csv",TextType="string");

    View the first few rows of the table. The table has two columns Token and Entity that correspond to the token and entities, respectively.

    head(tbl)
                 Token                 Entity   
        ________________________    ____________
    
        "Analyze"                   "non-entity"
        "text"                      "non-entity"
        "in"                        "non-entity"
        "MATLAB"                    "product"   
        "using"                     "non-entity"
        "Text Analytics Toolbox"    "product"   
        "."                         "non-entity"
        "Engineers"                 "non-entity"
    

    Train an HMM-based NER model using the trainHMMEntityModel function.

    mdl = trainHMMEntityModel(tbl)
    mdl = 
      hmmEntityModel with properties:
    
        Entities: [3×1 categorical]
    
    

    View the entities of the model.

    mdl.Entities
    ans = 3×1 categorical
         organization 
         product 
         non-entity 
    
    

    To add entity details to documents using the trained hmmEntityModel object, use the addEntityDetails function and set the Model option to the trained NER model.

    Create a tokenized document containing text data.

    str = "MathWorks develops MATLAB and Simulink.";
    document = tokenizedDocument(str);

    Add entity details using the trained hmmEntityModel object and view the updated token details using the tokenDetails function. The Entity column contains the predicted entities.

    document = addEntityDetails(document,Model=mdl);
    details = tokenDetails(document)
    details=6×8 table
           Token       DocumentNumber    SentenceNumber    LineNumber       Type        Language      PartOfSpeech          Entity   
        ___________    ______________    ______________    __________    ___________    ________    _________________    ____________
    
        "MathWorks"          1                 1               1         letters           en       proper-noun          organization
        "develops"           1                 1               1         letters           en       verb                 non-entity  
        "MATLAB"             1                 1               1         letters           en       proper-noun          product     
        "and"                1                 1               1         letters           en       coord-conjunction    non-entity  
        "Simulink"           1                 1               1         letters           en       proper-noun          product     
        "."                  1                 1               1         punctuation       en       punctuation          non-entity  
    
    

    Extract the tokens that are named entities.

    idx = details.Entity ~= "non-entity";
    details(idx,["Token" "Entity"])
    ans=3×2 table
           Token          Entity   
        ___________    ____________
    
        "MathWorks"    organization
        "MATLAB"       product     
        "Simulink"     product     
    
    

    Input Arguments

    collapse all

    Table of tokens and corresponding labels, specified as a table with these variables:

    • Token — Tokens, specified as string scalars or 1-by-1 cell arrays containing a character vector.

    • Entity — Entity labels, specified as categorical scalars, string scalars, 1-by-1 cell arrays containing a character vector.

    You must specify pairs of tokens and entities in context. The algorithm does not support lists of independent token-entity pairs. For example, you can specify this table.

    TokenEntity
    "William Shakespeare"person
    "was"non-entity
    "born"non-entity
    "in"non-entity
    "Stratford-upon-Avon"location
    "."non-entity

    To specify entities that span multiple tokens, use one of these approaches:

    • Whitespace delimited tokens — Specify multi-token entities as a single token with single entity value. For example, specify the token "William Shakespeare" and entity person.

    • IOB2 labeling scheme — For each entity, use the prefix "B-" (beginning) to denote the first token in each entity, and use the prefix "I-" (inside) to denote subsequent tokens in multi-token entities. Specify which entity corresponds to the "O" (outside) tag using the name argument. For example, specify the successive tokens "William" and "Shakespeare", and the corresponding entities B-person and I-person. For more information, see Inside, Outside, Beginning (IOB) Labeling Schemes.

    If you use the IOB2 labeling scheme, then all tokens in the input must use this scheme.

    Data Types: table

    List of tokens, specified as a tokenizedDocument scalar, string array, or a cell array of character vectors.

    You must specify tokens in context. The algorithm does not support lists of independent token-entity pairs. For example, you can specify the array of tokens ["William Shakespeare" "was" "born" "in" "Stratford-upon-Avon" "."].

    List of named entities, specified as a categorical array, string array, or a cell array of character vectors.

    To specify entities that span multiple tokens, use one of these approaches:

    • Whitespace delimited tokens — Specify multi-token entities as a single token with single entity value. For example, specify the token "William Shakespeare" and entity person.

    • IOB2 labeling scheme — For each entity, use the prefix "B-" (beginning) to denote the first token in each entity, and use the prefix "I-" (inside) to denote subsequent tokens in multi-token entities. Specify which entity corresponds to the "O" (outside) tag using the name argument. For example, specify the successive tokens "William" and "Shakespeare", and the corresponding entities B-person and I-person. For more information, see Inside, Outside, Beginning (IOB) Labeling Schemes.

    If you use the IOB2 labeling scheme, then all tokens in the input must use this scheme.

    The software automatically removes leading and trailing spaces from the entities. The entities must contain at least one nonwhitespace character.

    Data Types: char | string | cell | categorical

    Name to assign tokens that are not named entities, specified as a string scalar, character vector, or a cell array of character vectors.

    The software automatically removes leading and trailing spaces from the entities. The entities must contain at least one nonwhitespace character.

    Data Types: char | string | cell

    Output Arguments

    collapse all

    NER model, returned as an hmmEntityModel object.

    Algorithms

    collapse all

    Inside, Outside, Beginning (IOB) Labeling Schemes

    The inside, outside (IO) labeling scheme tags entities with "O" or prefixes the entities with "I". The tag "O" (outside) denotes non-entities. For each token in an entity, the tag is prefixed with "I-" (inside), which denotes that the token is part of an entity.

    A limitation of the IO labeling scheme is that it does not specify entity boundaries between adjacent entities of the same type. The inside, outside, beginning (IOB) labeling scheme, also known as the beginning, inside, outside (BIO) labeling scheme, addresses this limitation by introducing a "beginning" prefix.

    There are two variants of the IOB labeling scheme: IOB1 and IOB2.

    IOB2 Labeling Scheme

    For each token in an entity, the tag is prefixed with one of these values:

    • "B-" (beginning) — The token is a single token entity or the first token of a multi-token entity.

    • "I-" (inside) — The token is a subsequent token of a multi-token entity.

    For a list of entity tags Entity, the IOB labeling scheme helps identify boundaries between adjacent entities of the same type by using this logic:

    • If Entity(i) has prefix "B-" and Entity(i+1) is "O" or has prefix "B-", then Token(i) is a single entity.

    • If Entity(i) has prefix "B-", Entity(i+1), ..., Entity(N) has prefix "I-", and Entity(N+1) is "O" or has prefix "B-", then the phrase Token(i:N) is a multi-token entity.

    IOB1 Labeling Scheme

    The IOB1 labeling scheme do not use the prefix "B-" when an entity token follows an "O-" prefix. In this case, an entity token that is the first token in a list or follows a non-entity token implies that the entity token is the first token of an entity. That is, if Entity(i) has prefix "I-" and i is equal to 1 or Entity(i-1) has prefix "O-", then Token(i) is a single token entity or the first token of a multi-token entity.

    Version History

    Introduced in R2023a