Dynamic Regular Expressions

Introduction

In a dynamic expression, you can make the pattern that you want regexp to match dependent on the content of the input string. In this way, you can more closely match varying input patterns in the string being parsed. You can also use dynamic expressions in replacement strings for use with the regexprep function. This gives you the ability to adapt the replacement text to the parsed input.

You can include any number of dynamic expressions in the match_expr or replace_expr arguments of these commands:

regexp(string, match_expr)
regexpi(string, match_expr)
regexprep(string, match_expr, replace_expr)

As an example of a dynamic expression, the following regexprep command correctly replaces the term internationalization with its abbreviated form, i18n. However, to use it on a different term such as globalization, you have to use a different replacement expression:

match_expr = '(^\w)(\w*)(\w$)';

replace_expr1 = '$118$3';
regexprep('internationalization', match_expr, replace_expr1)
ans =
    i18n
replace_expr2 = '$111$3';
regexprep('globalization', match_expr, replace_expr2)
ans =
    g11n

Using a dynamic expression ${num2str(length($2))} enables you to base the replacement expression on the input string so that you do not have to change the expression each time. This example uses the dynamic replacement syntax ${cmd}.

match_expr = '(^\w)(\w*)(\w$)';
replace_expr = '$1${num2str(length($2))}$3';

regexprep('internationalization', match_expr, replace_expr)
ans =
    i18n
regexprep('globalization', match_expr, replace_expr)
ans =
    g11n

When parsed, a dynamic expression must correspond to a complete, valid regular expression. In addition, dynamic match expressions that use the backslash escape character (\) require two backslashes: one for the initial parsing of the expression, and one for the complete match. The parentheses that enclose dynamic expressions do not create a capturing group.

There are three forms of dynamic expressions that you can use in match expressions, and one form for replacement expressions, as described in the following sections

Dynamic Match Expressions — (??expr)

The (??expr) operator parses expression expr, and inserts the results back into the match expression. MATLAB® then evaluates the modified match expression.

Here is an example of the type of expression that you can use with this operator:

str = {'5XXXXX', '8XXXXXXXX', '1X'};
regexp(str, '^(\d+)(??X{$1})$', 'match', 'once');

The purpose of this particular command is to locate a series of X characters in each of the strings stored in the input cell array. Note however that the number of Xs varies in each string. If the count did not vary, you could use the expression X{n} to indicate that you want to match n of these characters. But, a constant value of n does not work in this case.

The solution used here is to capture the leading count number (e.g., the 5 in the first string of the cell array) in a token, and then to use that count in a dynamic expression. The dynamic expression in this example is (??X{$1}), where $1 is the value captured by the token \d+. The operator {$1} makes a quantifier of that token value. Because the expression is dynamic, the same pattern works on all three of the input strings in the cell array. With the first input string, regexp looks for five X characters; with the second, it looks for eight, and with the third, it looks for just one:

regexp(str, '^(\d+)(??X{$1})$', 'match', 'once')
ans = 
    '5XXXXX'    '8XXXXXXXX'    '1X'

Commands That Modify the Match Expression — (??@cmd)

MATLAB uses the (??@cmd) operator to include the results of a MATLAB command in the match expression. This command must return a string that can be used within the match expression.

For example, use the dynamic expression (??@flilplr($1)) to locate a palindrome string, "Never Odd or Even", that has been embedded into a larger string.

First, create the input string. Make sure that all letters are lowercase, and remove all nonword characters.

str = lower(...
  'Find the palindrome Never Odd or Even in this string');

str = regexprep(str, '\W*', '')
str =
   findthepalindromeneveroddoreveninthisstring

Locate the palindrome within the string using the dynamic expression:

palstr = regexp(str, '(.{3,}).?(??@fliplr($1))', 'match')
str =
   'neveroddoreven'

The dynamic expression reverses the order of the letters that make up the string, and then attempts to match as much of the reversed-order string as possible. This requires a dynamic expression because the value for $1 relies on the value of the token (.{3,}).

Dynamic expressions in MATLAB have access to the currently active workspace. This means that you can change any of the functions or variables used in a dynamic expression just by changing variables in the workspace. Repeat the last command of the example above, but this time define the function to be called within the expression using a function handle stored in the base workspace:

fun = @fliplr;

palstr = regexp(str, '(.{3,}).?(??@fun($1))', 'match')
palstr =
   'neveroddoreven'

Commands That Serve a Functional Purpose — (?@cmd)

The (?@cmd) operator specifies a MATLAB command that regexp or regexprep is to run while parsing the overall match expression. Unlike the other dynamic expressions in MATLAB, this operator does not alter the contents of the expression it is used in. Instead, you can use this functionality to get MATLAB to report just what steps it is taking as it parses the contents of one of your regular expressions. This functionality can be useful in diagnosing your regular expressions.

The following example parses a word for zero or more characters followed by two identical characters followed again by zero or more characters:

regexp('mississippi', '\w*(\w)\1\w*', 'match')
ans = 
    'mississippi'

To track the exact steps that MATLAB takes in determining the match, the example inserts a short script (?@disp($1)) in the expression to display the characters that finally constitute the match. Because the example uses greedy quantifiers, MATLAB attempts to match as much of the string as possible. So, even though MATLAB finds a match toward the beginning of the string, it continues to look for more matches until it arrives at the very end of the string. From there, it backs up through the letters i then p and the next p, stopping at that point because the match is finally satisfied:

regexp('mississippi', '\w*(\w)(?@disp($1))\1\w*', 'match')
i
p
p

ans = 
    'mississippi'

Now try the same example again, this time making the first quantifier lazy (*?). Again, MATLAB makes the same match:

regexp('mississippi', '\w*?(\w)\1\w*', 'match')
ans = 
    'mississippi'

But by inserting a dynamic script, you can see that this time, MATLAB has matched the string quite differently. In this case, MATLAB uses the very first match it can find, and does not even consider the rest of the string:

regexp('mississippi', '\w*?(\w)(?@disp($1))\1\w*', 'match')
m
i
s

ans = 
    'mississippi'

To demonstrate how versatile this type of dynamic expression can be, consider the next example that progressively assembles a cell array as MATLAB iteratively parses the input string. The (?!) operator found at the end of the expression is actually an empty lookahead operator, and forces a failure at each iteration. This forced failure is necessary if you want to trace the steps that MATLAB is taking to resolve the expression.

MATLAB makes a number of passes through the input string, each time trying another combination of letters to see if a fit better than last match can be found. On any passes in which no matches are found, the test results in an empty string. The dynamic script (?@if(~isempty($&))) serves to omit these strings from the matches cell array:

matches = {};
expr = ['(Euler\s)?(Cauchy\s)?(Boole)?(?@if(~isempty($&)),' ...
   'matches{end+1}=$&;end)(?!)'];

regexp('Euler Cauchy Boole', expr);

matches
matches = 
    'Euler Cauchy Boole'    'Euler Cauchy '    'Euler '    
'Cauchy Boole'    'Cauchy '    'Boole'

The operators $& (or the equivalent $0), $`, and $' refer to that part of the input string that is currently a match, all characters that precede the current match, and all characters to follow the current match, respectively. These operators are sometimes useful when working with dynamic expressions, particularly those that employ the (?@cmd) operator.

This example parses the input string looking for the letter g. At each iteration through the string, regexp compares the current character with g, and not finding it, advances to the next character. The example tracks the progress of scan through the string by marking the current location being parsed with a ^ character.

(The $` and operators capture that part of the string that precedes and follows the current parsing location. You need two single-quotation marks ($'') to express the sequence when it appears within a string.)

str = 'abcdefghij';
expr = '(?@disp(sprintf(''starting match: [%s^%s]'',$`,$'')))g';

regexp(str, expr, 'once');
starting match: [^abcdefghij]
starting match: [a^bcdefghij]
starting match: [ab^cdefghij]
starting match: [abc^defghij]
starting match: [abcd^efghij]
starting match: [abcde^fghij]
starting match: [abcdef^ghij]

Commands in Replacement Expressions — ${cmd}

The ${cmd} operator modifies the contents of a regular expression replacement string, making this string adaptable to parameters in the input string that might vary from one use to the next. As with the other dynamic expressions used in MATLAB, you can include any number of these expressions within the overall replacement expression.

In the regexprep call shown here, the replacement string is '${convertMe($1,$2)}'. In this case, the entire replacement string is a dynamic expression:

regexprep('This highway is 125 miles long', ...
          '(\d+\.?\d*)\W(\w+)', '${convertMe($1,$2)}');

The dynamic expression tells MATLAB to execute a function named convertMe using the two tokens (\d+\.?\d*) and (\w+), derived from the string being matched, as input arguments in the call to convertMe. The replacement string requires a dynamic expression because the values of $1 and $2 are generated at runtime.

The following example defines the file named convertMe that converts measurements from imperial units to metric.

function valout  = convertMe(valin, units)
switch(units)
    case 'inches'
        fun = @(in)in .* 2.54;
        uout = 'centimeters';
    case 'miles'
        fun = @(mi)mi .* 1.6093;
        uout = 'kilometers';
    case 'pounds'
        fun = @(lb)lb .* 0.4536;
        uout = 'kilograms';
    case 'pints'
        fun = @(pt)pt .* 0.4731;
        uout = 'litres';
    case 'ounces'
        fun = @(oz)oz .* 28.35;
        uout = 'grams';
end
val = fun(str2num(valin));
valout = [num2str(val) ' ' uout];
end

At the command line, call the convertMe function from regexprep, passing in values for the quantity to be converted and name of the imperial unit:

regexprep('This highway is 125 miles long', ...
          '(\d+\.?\d*)\W(\w+)', '${convertMe($1,$2)}')
ans =
   This highway is 201.1625 kilometers long
regexprep('This pitcher holds 2.5 pints of water', ...
          '(\d+\.?\d*)\W(\w+)', '${convertMe($1,$2)}')
ans =
   This pitcher holds 1.1828 litres of water
regexprep('This stone weighs about 10 pounds', ...
          '(\d+\.?\d*)\W(\w+)', '${convertMe($1,$2)}')
ans =
   This stone weighs about 4.536 kilograms

As with the (??@ ) operator discussed in an earlier section, the ${ } operator has access to variables in the currently active workspace. The following regexprep command uses the array A defined in the base workspace:

A = magic(3)
A =
     8     1     6
     3     5     7
     4     9     2
regexprep('The columns of matrix _nam are _val', ...
          {'_nam', '_val'}, ...
          {'A', '${sprintf(''%d%d%d '', A)}'})
ans =
The columns of matrix A are 834 159 672

See Also

| |

More About

Was this topic helpful?