trainHMMEntityModel
Syntax
Description
Use the trainHMMEntityModel
function to train a model for
named entity recognition (NER) that is based on a hidden Markov model (HMM).
The addDependencyDetails
function automatically detects person names, locations,
organizations, and other named entities in text. If you want to train a custom model that
predicts different tags, or train a model using your own data, then you can use the
trainHMMEntityModel
function.
Examples
Train HMM-based NER Model
Read the example entity data from exampleEntities.csv
into a table.
tbl = readtable("exampleEntities.csv",TextType="string");
View the first few rows of the table. The table has two columns Token
and Entity
that correspond to the token and entities, respectively.
head(tbl)
Token Entity ________________________ ____________ "Analyze" "non-entity" "text" "non-entity" "in" "non-entity" "MATLAB" "product" "using" "non-entity" "Text Analytics Toolbox" "product" "." "non-entity" "Engineers" "non-entity"
Train an HMM-based NER model using the trainHMMEntityModel
function.
mdl = trainHMMEntityModel(tbl)
mdl = hmmEntityModel with properties: Entities: [3×1 categorical]
View the entities of the model.
mdl.Entities
ans = 3×1 categorical
organization
product
non-entity
To add entity details to documents using the trained hmmEntityModel
object, use the addEntityDetails
function and set the Model
option to the trained NER model.
Create a tokenized document containing text data.
str = "MathWorks develops MATLAB and Simulink.";
document = tokenizedDocument(str);
Add entity details using the trained hmmEntityModel
object and view the updated token details using the tokenDetails
function. The Entity
column contains the predicted entities.
document = addEntityDetails(document,Model=mdl); details = tokenDetails(document)
details=6×8 table
Token DocumentNumber SentenceNumber LineNumber Type Language PartOfSpeech Entity
___________ ______________ ______________ __________ ___________ ________ _________________ ____________
"MathWorks" 1 1 1 letters en proper-noun organization
"develops" 1 1 1 letters en verb non-entity
"MATLAB" 1 1 1 letters en proper-noun product
"and" 1 1 1 letters en coord-conjunction non-entity
"Simulink" 1 1 1 letters en proper-noun product
"." 1 1 1 punctuation en punctuation non-entity
Extract the tokens that are named entities.
idx = details.Entity ~= "non-entity"; details(idx,["Token" "Entity"])
ans=3×2 table
Token Entity
___________ ____________
"MathWorks" organization
"MATLAB" product
"Simulink" product
Input Arguments
tbl
— Table of tokens and corresponding labels
table
Table of tokens and corresponding labels, specified as a table with these variables:
Token
— Tokens, specified as string scalars or 1-by-1 cell arrays containing a character vector.Entity
— Entity labels, specified as categorical scalars, string scalars, 1-by-1 cell arrays containing a character vector.
You must specify pairs of tokens and entities in context. The algorithm does not support lists of independent token-entity pairs. For example, you can specify this table.
Token | Entity |
---|---|
"William Shakespeare" | person |
"was" | non-entity |
"born" | non-entity |
"in" | non-entity |
"Stratford-upon-Avon" | location |
"." | non-entity |
To specify entities that span multiple tokens, use one of these approaches:
Whitespace delimited tokens — Specify multi-token entities as a single token with single entity value. For example, specify the token
"William Shakespeare"
and entityperson
.IOB2 labeling scheme — For each entity, use the prefix
"B-"
(beginning) to denote the first token in each entity, and use the prefix"I-"
(inside) to denote subsequent tokens in multi-token entities. Specify which entity corresponds to the"O"
(outside) tag using thename
argument. For example, specify the successive tokens"William"
and"Shakespeare"
, and the corresponding entitiesB-person
andI-person
. For more information, see Inside, Outside, Beginning (IOB) Labeling Schemes.
If you use the IOB2 labeling scheme, then all tokens in the input must use this scheme.
Data Types: table
tokens
— List of tokens
tokenizedDocument
scalar | string array | cell array of character vectors
List of tokens, specified as a tokenizedDocument
scalar, string array, or a cell array of character
vectors.
You must specify tokens in context. The algorithm does not support lists of
independent token-entity pairs. For example, you can specify the array of tokens
["William Shakespeare" "was" "born" "in" "Stratford-upon-Avon"
"."]
.
entities
— List of named entities
categorical array | string array | cell array of character vectors
List of named entities, specified as a categorical array, string array, or a cell array of character vectors.
To specify entities that span multiple tokens, use one of these approaches:
Whitespace delimited tokens — Specify multi-token entities as a single token with single entity value. For example, specify the token
"William Shakespeare"
and entityperson
.IOB2 labeling scheme — For each entity, use the prefix
"B-"
(beginning) to denote the first token in each entity, and use the prefix"I-"
(inside) to denote subsequent tokens in multi-token entities. Specify which entity corresponds to the"O"
(outside) tag using thename
argument. For example, specify the successive tokens"William"
and"Shakespeare"
, and the corresponding entitiesB-person
andI-person
. For more information, see Inside, Outside, Beginning (IOB) Labeling Schemes.
If you use the IOB2 labeling scheme, then all tokens in the input must use this scheme.
The software automatically removes leading and trailing spaces from the entities. The entities must contain at least one nonwhitespace character.
Data Types: char
| string
| cell
| categorical
name
— Name to assign tokens that are not named entities
"non-entity"
(default) | string scalar | character vector | cell array containing single character vector
Name to assign tokens that are not named entities, specified as a string scalar, character vector, or a cell array of character vectors.
The software automatically removes leading and trailing spaces from the entities. The entities must contain at least one nonwhitespace character.
Data Types: char
| string
| cell
Output Arguments
mdl
— NER model
hmmEntityModel
NER model, returned as an hmmEntityModel
object.
Algorithms
Inside, Outside, Beginning (IOB) Labeling Schemes
The inside, outside (IO) labeling scheme tags entities with
"O"
or prefixes the entities with "I"
. The tag
"O"
(outside) denotes non-entities. For each token in an entity, the
tag is prefixed with "I-"
(inside), which denotes that the token is part
of an entity.
A limitation of the IO labeling scheme is that it does not specify entity boundaries between adjacent entities of the same type. The inside, outside, beginning (IOB) labeling scheme, also known as the beginning, inside, outside (BIO) labeling scheme, addresses this limitation by introducing a "beginning" prefix.
There are two variants of the IOB labeling scheme: IOB1 and IOB2.
For each token in an entity, the tag is prefixed with one of these values:
"B-"
(beginning) — The token is a single token entity or the first token of a multi-token entity."I-"
(inside) — The token is a subsequent token of a multi-token entity.
For a list of entity tags Entity
, the IOB labeling
scheme helps identify boundaries between adjacent entities of the same type by using
this logic:
If
Entity(i)
has prefix"B-"
andEntity(i+1)
is"O"
or has prefix"B-"
, thenToken(i)
is a single entity.If
Entity(i)
has prefix"B-"
,Entity(i+1)
, ...,Entity(N)
has prefix"I-"
, andEntity(N+1)
is"O"
or has prefix"B-"
, then the phraseToken(i:N)
is a multi-token entity.
The IOB1 labeling scheme do not use the prefix "B-"
when an entity token
follows an "O-"
prefix. In this case, an entity token that is the
first token in a list or follows a non-entity token implies that the entity token is the
first token of an entity. That is, if Entity(i)
has prefix
"I-"
and i
is equal to 1 or
Entity(i-1)
has prefix "O-"
, then
Token(i)
is a single token entity or the first token of a
multi-token entity.
Version History
Introduced in R2023a
Open Example
You have a modified version of this example. Do you want to open this example with your edits?
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)