strmatch

Match pattern in character string

Use only in the MuPAD Notebook Interface.

This functionality does not run in MATLAB.

Syntax

strmatch(text, pattern, <Index>, <ReturnMatches>, <All>)

Description

strmatch(text, pattern) checks whether text matches the regular expression pattern.

strmatch performs regular expression matching on strings, via the ICU library. The pattern can contain wildcards forming a perl-compatible regular expression. In these expressions, most characters represent themselves. For example, "a" matches "a". For the list of exceptions, see Algorithms.

The library stringlib provides more functions for handling strings. For details, see Operations on Strings.

Examples

Example 1

Most characters simply match themselves:

s := "Hamburg": strmatch(s, "Hamburg")

strmatch typically matches substrings:

strmatch(s, "Ham"), strmatch(s, "burg")

strmatch("Ham", "Hamburg")

delete s:

Example 2

A dot (.) is a placeholder for any character except "\n":

strmatch("abcd", "a.c"), strmatch("ab\ncd", "ab.")

To match an actual dot, use "\\.":

strmatch("abcd", "a\\.c"),
strmatch("a.cd", "a\\.c")

A dot, like all special characters, has its special role only in the second argument of strmatch:

strmatch("a.c", "abc")

With the s modifier, you can use a dot to match newlines:

strmatch("abcd", "(?s)a.c"), strmatch("ab\ncd", "(?s)ab.")

A dot matches only a single character:

strmatch("abcd", "a.d"), strmatch("abcd", "a.b")

Example 3

By default, strmatch only checks for a match and returns a Boolean value:

strmatch("aaaba", "a"), strmatch("aaaba", "c")

To return the first place where a match occurs, use Index:

strmatch("aaaba", "a", Index), 
strmatch("aaaba", "c", Index)

To return the matched substrings, use ReturnMatches. This option is helpful when you match complicated expressions.

strmatch("aaaba", "a", ReturnMatches), 
strmatch("aaaba", "c", ReturnMatches)

To find more than one match, use All:

strmatch("aaaba", "a", All), 
strmatch("aaaba", "c", All)

This expression has several matches because a dot matches any character:

strmatch("aaaba", "a.", All)

All implies ReturnMatches unless you also use Index:

strmatch("aaaba", "a", All, Index)

Combine all three options:

strmatch("aaaba", "a", All, Index, ReturnMatches)

Example 4

By default, strmatch matches substrings. To look only for matches at the beginning and end of the string, use the caret (^) and dollar ($) characters, respectively:

strmatch("abcd", "a"),
strmatch("abcd", "c"),
strmatch("abcd", "d"),
strmatch("abcd", "abcd")

strmatch("abcd", "^a"),
strmatch("abcd", "^c"),
strmatch("abcd", "^d"),
strmatch("abcd", "^abcd")

strmatch("abcd", "a$"),
strmatch("abcd", "c$"),
strmatch("abcd", "d$"),
strmatch("abcd", "abcd$")

strmatch("abcd", "^a$"),
strmatch("abcd", "^c$"),
strmatch("abcd", "^d$"),
strmatch("abcd", "^abcd$")

Using the m modifier, you can change the meaning from the beginning or end of a string to the beginning or end of a line:

s := "ab\ncd":
strmatch(s, "b$"),
strmatch(s, "(?m)b$")

Example 5

Specify alternative patterns to match by using the vertical bar (|):

strmatch("abcd", "abc|xyz")

strmatch("abcd", "a|f|j")

strmatch treats all characters between the vertical bars as one of the alternative patterns. To limit the extent of alternatives, use parentheses:

strmatch("abcd", "ab(c|xy)z"),
strmatch("abcd", "ab(c|xy)(z|d)")

When you use the ReturnMatches option, strmatch returns the substrings matched by each pair of parentheses:

strmatch("abcd", "ab(c|xy)(z|d)", ReturnMatches)

With alternatives, strmatch can find several matches:

strmatch("abracadabra", "a(b|c|d)", All)

To group alternatives without returning matches, use (?:...):

strmatch("abracadabra", "a(?:b|c|d)", All)

To match for the characters "|", "(", and ")", use \\ before the character when you specify the pattern to match:

strmatch("ab(c)d", "\\((c|d)\\)", ReturnMatches)

Example 6

Use a question mark (?) to indicate that a subexpression (a single character or a group of characters in parentheses or brackets) is optional:

strmatch("abcd", "abc?d"),
strmatch("abd", "abc?d")

Use an asterisk (*) to indicate that a subexpression can be repeated an arbitrary number of times, including zero:

strmatch("abcd", "a.*d"),
strmatch("abcd", "a.*c")

Use a plus sign (+) to indicate that a subexpression can be repeated an arbitrary number of times, excluding zero:

strmatch("abcd", "a.+d"),
strmatch("abcd", "a.+b")

When you use the asterisk or the plus sign in the pattern to match, strmatch finds the first match going from left to right, and then returns the longest substring that satisfies the matching pattern:

strmatch("abracadabra", "a.*a", ReturnMatches)

By appending another question mark, you can switch the asterisk and plus sign to "non-greedy" matching:

strmatch("abracadabra", "a.*?a", ReturnMatches)

This does not return the shortest match (which would have been "aca" or "ada"). The call returns the first match looking from left to right from the starting position.

Example 7

Use curly braces to specify a number of repetitions of a subexpression:

strmatch("abracadabra", "(a(b|c|d)){2}"),
strmatch("abracadabra", "(a(b|c|d)){3}"),
strmatch("abracadabra", "(a(b|c|d)){4}")

These repetitions must be adjacent:

strmatch("abracadabra", "(abr){2}")

To get nonadjacent repetitions, use ".*". This combination means "anything without newlines".

strmatch("abracadabra", "(abr.*){2}")

Example 8

To indicate a range of possible repetitions, use a comma inside curly braces. For example, select the expressions representing binary numbers with three to five digits:

select(["11001", "1100111", "11", "11021"], strmatch, "^((0|1){3,5})$")

Here {3,5} specifies the range. You can omit the second number to remove the upper bound. For example, {3,} indicates that there must be three or more repetitions.

The following regular expression checks whether there is an "a" followed by at least three letters "b" followed by a "c" somewhere in the input string:

strmatch("abcd", "ab{3,}c"),
strmatch("abbbcd", "ab{3,}c"),
strmatch("abcdabbbc", "ab{3,}c")

By default, when strmatch looks for repetitions, it returns the longest matching substring. Use a question mark to return the first match instead of the longest one:

strmatch("abcdabcdabcd", "a.{2,8}d", ReturnMatches),
strmatch("abcdabcdabcd", "a.{2,8}?d", ReturnMatches)

Example 9

Characters enclosed in brackets ([ ]) form a "character class", which matches any of the characters in the class. This behavior is similar to an alternation between these characters.

strmatch("abc", "ab[cde]"),
strmatch("abd", "ab[cde]"),
strmatch("aba", "ab[cde]")

Inside character classes, special characters are completely different. Dots, asterisks, plus and dollar signs, parentheses, and curly braces match themselves, but a character class starting with a caret is "negated" and matches any character not listed:

strmatch("abcd", "[^ab]", All)

If a caret is not the first character in a class, then it represents itself:

strmatch("x^2", "[*^]2")

If a dash (-) is not the first character in a class (apart from the caret), then it specifies a range of characters. Thus, to find a number with at least five digits, you can specify the pattern as follows:

strmatch("x = 123456...", "[0-9]{5,}", ReturnMatches)

The exact meaning of a range depends on the language settings of your computer. Technically, it depends on the "collating", which can be different for the same language on different versions of the same operating system. For example, "[a-z]" can match only lowercase ASCII characters on one computer, while on the second one it also matches the uppercase characters from A through Y, and on the third one includes the uppercase characters from B through Z. For this reason, the best practice is using the named character classes instead:

strmatch("some words", "[[:word:]]+", All)

Example 10

Some character classes have a short form, such as "\\x", where x is w, W, s, S, d, or D. The uppercase letters mean the negation of the lowercase letters.

strmatch("abcd", "\\w"),
strmatch("abcd", "\\W"),
strmatch("abcd", "\\d"),
strmatch("abcd", "\\D")

Here, negation means that a character does not match whatever is negated:

strmatch("abcd 1", "\\w"),
strmatch("abcd 1", "\\W"),
strmatch("abcd 1", "\\d"),
strmatch("abcd 1", "\\D")

Use "\\b" to look for words starting with an a. The pattern "\\b" is a zero-width expression matching the place between a "word" and the spaces surrounding it (or the beginning and end of string).

strmatch("abc cbd cba (aa) b", "\\ba\\w*", All)

You can also use "\\b" to match the end of a word:

strmatch("abc cbd cba (aa) b", "\\w*a\\b", All)

Example 11

You can change the behavior of strmatch with modifier flags. For example, the i modifier enables case-insensitive matching. (The precise effects of case-insensitive matching depend on your language settings, for example, most English computers do not treat the German umlauts ä and Ä as being the same up to case.) To enable case-insensitive matching for the whole expression, prefix it with "(?i)":

strmatch("ABC", "(?i)ab")

To limit the effect of the modifier to some part of the expression, use "(?i:...)":

strmatch("ABC", "(?i:a)b"),
strmatch("abc", "(?i:a)b"),
strmatch("Abc", "(?i:a)b")

Example 12

strmatch with ReturnMatches or All (without Index) returns the matched substrings. You can also return parts of those substrings. For example, extract all function names from this expression. To identify function names, note that an opening parenthesis or a space and an opening parenthesis follows every function name.

s := "f(sin (x) + abc + def(x))":
strmatch(s, "\\b\\w+\\s*\\(", All)

To extract the function names themselves, use this command:

map(strmatch(s, "\\b(\\w+)\\s*\\(", All), op, 2)

Regular expressions can contain zero-width assertions. These assertions ensure that something does or does not follow, without actually including it or moving the conceptual pointer behind it. Therefore, the more efficient approach is to wrap the corresponding expression in "(?=...)":

strmatch(s, "\\b\\w+(?=\\s*\\()", All)

Example 13

Regular expressions can also make zero-width assertions with respect to the preceding text. Such assertions must have a fixed width. For example, extract the amount of money mentioned in this string:

s := "In March 2005, we've spent $1192.23 on light.":
strmatch(s, "(?<=\\$)\\d+(?:\\.\\d\\d)?", All)

Example 14

To detect the positions of the matches in the input string, use the Index option. The returned list contains two numbers: the beginning and end of the match.

strmatch("abc", "b", Index)

If no match is found, strmatch returns FALSE:

strmatch("abc", "d", Index)

If you use both Index and ReturnMatches, then strmatch returns indices followed by the matched subexpressions:

strmatch("abc", "b.", ReturnMatches, Index)

Example 15

If you use All, then the return value is a set:

strmatch("abc", ".", All),
strmatch("abc", ".", Index, All),
strmatch("abc", ".", ReturnMatches, All)

Parameters

text, pattern

character strings

Options

Index

Return the position of the match. If there are no matches, strmatch returns FALSE. Otherwise, it returns the position of the match as a list of two integers, [i, j], such that text[i..j] is the matched substring.

ReturnMatches

Return the matched substrings. If the regular expression contains groups (subexpressions in parentheses), then strmatch returns lists containing the matched substring and the strings matched by the groups, in order of opening parentheses.

All

Return all matches that strprint can find. By default, strmatch returns only the first match. If you do not use Index, then the All option also implies ReturnMatches.

Return Values

Without options, TRUE or FALSE is returned. With Index, a list of two nonnegative integers or FALSE is returned. With option ReturnMatches, a string or a list of strings is returned, depending on whether the pattern contains groups. With both Index and ReturnMatches, a list starting with the indices of the match, followed by the string or strings of ReturnMatches, is returned. With option All, a set is returned.

Overloaded By

pattern, text

Algorithms

  • A dot (.) matches any character, except "\n". With the s modifier, a dot matches any character. See Example 2.

  • A caret (^) matches the beginning of a line. A dollar ($) matches the end of a line. Typically, ^ and $ mark the beginning and end of the string, but with the m modifier they also can appear after or before a "\n". See Example 4.

  • A pattern enclosed in parentheses (()) is considered "grouped."

  • A vertical bar (|) between two characters or groups (sub-regexes) lets you specify alternative matching patterns. Any one of the alternatives matching is sufficient. See Example 5.

  • A sub-regex followed by a number n enclosed in {} must match exactly n times.

    A sub-regex followed by {n,} must match at least n times.

    A sub-regex followed by {n,m} must match at least n and at most m times.

    In any other context, { and } are treated as normal characters.

    See Example 7.

  • Following a sub-regex, a question mark (?) works as {0,1}, making the sub-regex optional.

    A plus sign (+) in this context works as {1,} and allows an arbitrary positive number of repetitions.

    An asterisk after an expression is equivalent to {0,} and allows an arbitrary number of repetitions, including zero.

    See Example 6.

  • By default, {n,} and its three shorthand forms match as many characters as possible. By following them with another question mark (for example, "a(b[cd]){2,}?bd", "(0|1)*?12"), you can specify that strmatch must return the lowest number of characters consistent with the remainder of the pattern.

  • While a backslash (which must be typed as "\\") escapes any special character (including itself), it makes some characters following it special. See Example 10.

    • "\\w" matches a "word" character (alphanumeric or underline).

    • "\\W" matches a character not matched by "\\w".

    • "\\s" matches a white-space character (space, or tabulator, or, if the s modifier is active, also an end-of-line character).

    • "\\S" matches a character not matched by "\\s".

    • "\\d" matches a digit.

    • "\\D" matches a nondigit.

    • "\\b" matches the place between a word character and a nonword character, for example, the place where a word starts or ends.

    • "\\B" is also zero-width, but matches those places where "\\b" does not.

    • "\\A" and "\\Z" match at the beginning and end of the string, respectively. "\\Z" ignores a "\n" at the end of the string; "\\z" behaves like "\\Z", but does not ignore a trailing "\n".

    • '\\X' matches a grapheme cluster. For example, the letter is a grapheme cluster: it consists of a and ̄. '\\X' lets you access as one entity.

  • Characters enclosed between [ and ] form a character class. See Example 9.

    A character class starting with ^ is negated, and matches all the characters not listed. The symbol ^ at any other place in the character class has no special meaning.

    Inside a character class, the special characters, except for a hyphen, do not have any special meaning. If a hyphen (-) is not the first character, then it creates a range of characters. The language settings of your operating system (technically speaking, the current locale) affect how strmatch interprets this range. Likely, in every language setting "[0-9]" represents any digit.

    To specify character classes independent of language settings, use named access to POSIX character classes:

    • "[[:digit:]]" for any digit.

    • "[[:alpha:]]" for characters (the language settings define what makes a character).

    • "[[:alnum:]]" for alphanumerical characters.

    • "[[:word:]]" for alphanumerical characters plus the underline (_).

    • "[[:punct:]]" for punctuation characters, such as a dot or a comma.

    • "[[:ascii:]]" for characters in the ASCII range (decimal codes 32 through 127).

    • "[[:blank:]]" for horizontal spaces, such as[ \t].

    • "[[:space:]]" for spaces, including end-of-line.

    • "[[:cntrl:]]" for control characters, such as newlines. Note that you cannot type most control characters in MuPAD®, but they can occur in strings read from files.

    • "[[:graph:]]" for the class of alphanumeric or punctuation characters, that is, characters with visual graphical representation.

    • "[[:print:]]" is equivalent to "[ [:graph:]]". It adds the space character to the graph class.

    • "[[:lower:]]" and "[[:upper:]]" for the characters that your language settings consider lowercase and uppercase letters. For example, a German system is more likely to know about ä being a lowercase letter than a U.S. system.

    • "[[:xdigit:]]" matches hexadecimal digits. It is equivalent to [0123456789aAbBcCdDeEfF].

    You combine these classes with one another or add characters from one class to another class. For example, you can match septendecimal digits with "[[:xdigit:]gG]".

    You can negate posix character classes using a caret. For example, "[[:^digit:]]" matches nondigits. This is equivalent to "[^[:digit:]]", but "[0[:^digit:]]" to allow any nondigit or zero is more difficult to express otherwise.

  • Groups starting with (? have special meanings:

    • Groups starting with (?: behave like other groups, but do not create output matches for the ReturnMatches option.

    • "(?#text)" is a comment and effectively ignored.

    • Groups starting with (?X:, where X is one of i, m, s, x, locally apply modifiers:

      • i causes all pattern matching to be case-insensitive (as defined by the system's locale).

      • m causes a "multiline" match, where ^ and $ match after/before "\n" characters in the string.

      • s makes the dot match newlines.

      • x allows perl-style comments in the pattern. In this case, strmatch ignores spaces in most contexts. The # characters start comments that extend to the end of the line.

      When using these options in an outer group, you can disable them by preceding them with a minus sign, as in "(?-i:aB)".

    • The string "(?X)", where X is one of the characters listed above, switches the corresponding setting on up to the end of the enclosing group.

    • (?= starts a positive zero-width lookahead assertion. This is a zero-width item (and therefore does not add something to the output) that matches if its contents match at the current position. See Example 12.

    • (?! starts a zero-width negative look-ahead assertion. It behaves almost identical to (?= except it matches if and only if (?= does not.

    • (?<= starts a positive zero-width look-behind assertion, which is like (?=, but looking in the other direction. Look-behind assertions must have a fixed width. See Example 13.

    • (?<! starts a negative zero-width look-behind assertion, which matches if and only if a (?<= at the same place does not match.

Was this topic helpful?