Build Pattern Expressions
Since R2020b
Patterns are a tool to aid in searching for and modifying text. Similar to regular expressions, a pattern defines rules for matching text. Patterns can be used with text-searching functions like contains
, matches
, and extract
to specify which portions of text these functions act on. You can build a pattern expression in a way similar to how you would build a mathematical expression, using pattern functions, operators, and literal text. Because building pattern expressions is open ended, patterns can become quite complicated. Building patterns in steps and using functions like maskedPattern
and namedPattern
can help organize complicated patterns.
Building Simple Patterns
The simplest pattern is built from a single pattern function. For example, lettersPattern
matches any letter characters. There are many pattern functions for matching different types of characters and other features of text. A list of these functions can be found on the pattern
reference page.
txt = "abc123def";
pat = lettersPattern;
extract(txt,pat)
ans = 2x1 string
"abc"
"def"
Patterns combine with other patterns and literal text by using the plus(+)
operator. This operator appends patterns and text together in the order they are defined in the pattern expression. The combined patterns only match text in the same order. In this example, "YYYY/MM/DD" is not a match because a four-letter string must be at the end of the text.
txt = "Dates can be expressed as MM/DD/YYYY, DD/MM/YYYY, or YYYY/MM/DD"; pat = lettersPattern(2) + "/" + lettersPattern(2) + "/" + lettersPattern(4); extract(txt,pat)
ans = 2x1 string
"MM/DD/YYYY"
"DD/MM/YYYY"
Patterns used with the or(|)
operator specify that only one of the two specified patterns needs to match a section of text. If neither pattern is able to match then the pattern expression fails to match.
txt = "123abc";
pat = lettersPattern|digitsPattern;
extract(txt,pat)
ans = 2x1 string
"123"
"abc"
Some pattern functions take patterns as their input and modify them in some way. For example, optionalPattern
makes a specified pattern match if possible, but the pattern is not required for a successful match.
txt = ["123abc" "abc"]; pat = optionalPattern(digitsPattern) + lettersPattern; extract(txt,pat)
ans = 1x2 string
"123abc" "abc"
Boundary Patterns
Boundary patterns are a special type of pattern that do not match characters but rather match the boundaries between a designated character type and other characters or the start or end of that piece of text. For example, digitBoundary
matches the boundaries between digit characters and nondigit characters and between digit characters and the start or end of the text. It does not match digit characters themselves. Boundary patterns are useful as delimiters for functions like split
.
txt = "123abc";
pat = digitBoundary;
split(txt,pat)
ans = 3x1 string
""
"123"
"abc"
Boundary patterns are special amongst patterns because they can be negated using the not(~)
operator. When negated in this way, boundary patterns match before or after characters that did not satisfy the requirements above. For example, ~digitBoundary
matches the boundary between:
characters that are both digits
characters that are both nondigits
a nondigit character and the start or end of a piece of text
Use replace
to mark the locations matched by ~digitBoundary
with a "|"
character.
txt = "123abc"; pat = ~digitBoundary; replace(txt,pat,"|")
ans = "1|2|3a|b|c|"
Building Complicated Patterns in Steps
Sometimes a simple pattern is not sufficient to solve a problem and a more complicated pattern is needed. As a pattern expression grows it can become difficult to understand what it is matching. One way to simplify building a complicated pattern is building each part of the pattern separately and then combining the parts together into a single pattern expression.
For instance, email addresses use the form local_part@domain.TLD. Each of the three identifiers — local_part, domain, and TLD — must be a combination of digits, letters and underscore characters. To build the full pattern, start by defining a pattern for the identifiers. Build a pattern that matches one letter or digit character or one underscore character.
identCharacters = alphanumericsPattern(1) | "_";
Now, use asManyOfPattern
to match one or more consecutive instances of identCharacters
.
identifier = asManyOfPattern(identCharacters,1);
Next, build a pattern that matches an email containing multiple identifiers.
emailPattern = identifier + "@" + identifier + "." + identifier;
Test the pattern by seeing how well it matches the following example emails.
exampleEmails = ["janedoe@mathworks.com" "abe.lincoln@whitehouse.gov" "alberteinstein@physics.university.edu"]; matches(exampleEmails,emailPattern)
ans = 3x1 logical array
1
0
0
The pattern fails to match several of the example emails even though all the emails are valid. Both the local_part and domain can be made of a series of identifiers that are separated by periods. Use the identifier
pattern to build a pattern that is capable of matching a series of identifiers. asManyOfPattern
matches as many concurrent appearances of the specified pattern as possible, but if there are none the rest of the pattern is still able to match successfully.
identifierSeries = asManyOfPattern(identifier + ".") + identifier;
Use this pattern to build a new emailPattern
that can match all of the example emails.
emailPattern = identifierSeries + "@" + identifierSeries + "." + identifier; matches(exampleEmails,emailPattern)
ans = 3x1 logical array
1
1
1
Organizing Pattern Display
Complex patterns can sometimes be difficult to read and interpret, especially by those you share them with who are unfamiliar with the pattern's structure. For example, when displayed, emailPattern
is long and difficult to read.
emailPattern
emailPattern = pattern
Matching:
asManyOfPattern(asManyOfPattern(alphanumericsPattern(1) | "_",1) + ".") + asManyOfPattern(alphanumericsPattern(1) | "_",1) + "@" + asManyOfPattern(asManyOfPattern(alphanumericsPattern(1) | "_",1) + ".") + asManyOfPattern(alphanumericsPattern(1) | "_",1) + "." + asManyOfPattern(alphanumericsPattern(1) | "_",1)
Part of the issue with the display is that there are many repetitions of the identifier
pattern. If the exact details of this pattern are not important to users of the pattern, then the display of the identifier
pattern can be concealed using maskedPattern
. This function creates a new pattern where the display of identifier
is masked and the variable name, "identifier"
, is displayed instead. Alternatively, you can specify a different name to be displayed. The details of patterns that are masked in this way can be accessed by clicking "Show all details
" in the displayed pattern.
identifier = maskedPattern(identifier);
identifierSeries = asManyOfPattern(identifier + ".") + identifier
identifierSeries = pattern
Matching:
asManyOfPattern(identifier + ".") + identifier
Use details to show more information
Patterns can be further organized using the namedPattern
function. namedPattern
designates a pattern as a named pattern that changes how the pattern is displayed when combined with other patterns. Email addresses have several important portions, local_part@domain.TLD, which each have their own matching rules. Create a named pattern for each section.
localPart = namedPattern(identifierSeries,"local_part");
Named patterns can be nested, to further delineate parts of a pattern. To nest a named pattern, build a pattern using named patterns and then designate that pattern as a named pattern. For example, Domain.TLD can be divided into the domain, subdomains, and the top level domain (TLD). Create named patterns for each part of domain.TLD.
subdomain = namedPattern(identifierSeries,"subdomain"); domainName = namedPattern(identifier,"domainName"); tld = namedPattern(identifier,"TLD");
Nest the named patterns for the components of domain underneath a single named pattern domain
.
domain = optionalPattern(subdomain + ".") + ... domainName + "." + ... tld; domain = namedPattern(domain);
Combine the patterns together into a single named pattern, emailPattern
. In the display of emailPattern
you can see each named pattern and what they match as well as the information on any nested named patterns.
emailPattern = localPart + "@" + domain
emailPattern = pattern
Matching:
local_part + "@" + domain
Using named patterns:
local_part : asManyOfPattern(identifier + ".") + identifier
domain : optionalPattern(subdomain + ".") + domainName + "." + TLD
subdomain : asManyOfPattern(identifier + ".") + identifier
domainName: identifier
TLD : identifier
Use details to show more information
You can access named patterns and nested named patterns by dot-indexing into a pattern. For example, you can access the nested named pattern subdomain
by dot-indexing from emailPattern
into domain
and then dot-indexing again into subdomain
.
emailPattern.domain.subdomain
ans = pattern
Matching:
asManyOfPattern(identifier + ".") + identifier
Use details to show more information
Dot-assignment can be used to change named patterns without needing to rewrite the rest of the pattern expression.
emailPattern.domain = "mathworks.com"
emailPattern = pattern
Matching:
local_part + "@" + domain
Using named patterns:
local_part: asManyOfPattern(identifier + ".") + identifier
domain : "mathworks.com"
Use details to show more information
Copyright 2020 The MathWorks, Inc.
See Also
pattern
| string
| regexp
| contains
| replace
| extract