Replace text in words of documents using regular expression
Replace words that begin with
"e", and have at least one character between them. To match whole words, use
"^" to match the start of a word and
"$" to match the end of the word.
documents = tokenizedDocument([ ... "an example of a short sentence" "a second short sentence"])
documents = 2x1 tokenizedDocument: 6 tokens: an example of a short sentence 4 tokens: a second short sentence
expression = "^s(\w+)e$"; replace = "thing"; newDocuments = regexprep(documents,expression,replace)
newDocuments = 2x1 tokenizedDocument: 6 tokens: an example of a short thing 4 tokens: a second short thing
If you do not use
"$", then you can match substrings of the words. Replace all vowels with "_".
expression = "[aeiou]"; replace = "\_"; newDocuments = regexprep(documents,expression,replace)
newDocuments = 2x1 tokenizedDocument: 6 tokens: _n _x_mpl_ _f _ sh_rt s_nt_nc_ 4 tokens: _ s_c_nd sh_rt s_nt_nc_
Replace variations of the word
"walk" by capturing the letters that follow
documents = tokenizedDocument([ "I walk" "they walked" "we are walking"])
documents = 3x1 tokenizedDocument: 2 tokens: I walk 2 tokens: they walked 3 tokens: we are walking
expression = "walk(\w*)"; replace = "ascend$1"; newDocuments = regexprep(documents,expression,replace)
newDocuments = 3x1 tokenizedDocument: 2 tokens: I ascend 2 tokens: they ascended 3 tokens: we are ascending
documents— Input documents
Input documents, specified as a
replace— Replacement text
Replacement text, specified as a character vector, a cell array of character vectors, or a string array, as follows:
replace is a single character vector and
expression is a cell array of character
regexprep uses the same
replacement text for each expression.
replace is a cell array of
N character vectors and
expression is a single character vector, then
matches and replacements.
expression are cell arrays of character
vectors, then they must contain the same number of elements.
regexprep pairs each
replace element with its corresponding
The replacement text can include regular characters, special characters (such as tabs or new lines), or replacement operators, as shown in the following tables.
Portion of the input text that is currently a match
Portion of the input text that precedes the current match
Portion of the input text that follows the current match
Output returned when MATLAB executes the command,
Any character with special meaning in regular expressions
that you want to match literally (for example, use