Skip to Main Content Skip to Search
Product Documentation

regexp - Match regular expression (case sensitive)

Syntax

regexp(string,expr)
[matchstart,matchend,tokenindices,matchstring,tokenstring,tokenname,splitstring] = regexp(string,expr)
[selected_outputs] = regexp(string,expr,outselect)
regexp(string,expr,options)

Description

regexp(string,expr) parses the input string locating those parts of string that match the character pattern(s) specified by regular expression, expr. This syntax returns the starting index of each match. If no matches are found, regexp returns an empty array.

[matchstart,matchend,tokenindices,matchstring,tokenstring,tokenname,splitstring] = regexp(string,expr) returns from one to seven output values, depending on the number of output variables you specify.

[selected_outputs] = regexp(string,expr,outselect) returns from one to seven output values, depending on which flags you specify in outselect. The presence and ordering of the outselect inputs determine the presence and ordering of the corresponding outputs.

regexp(string,expr,options) calls regexp with one or more of the nondefault options listed in the Command Options table, below. These options must follow the string and expr inputs in the argument list.

Tips

Input Arguments

string

A string or cell array of strings containing the string you want to parse. This string can be of any length and can contain any characters. This argument can also be a cell array of strings.

expr

A string or cell array of strings containing a MATLAB regular expression. This input consists of text and operators with which you specify character patterns to look for in string.

This table shows the main categories of metacharacters and operators that you can use in expr.

CategoryMetacharacters and Operators
Character Type OperatorsOne of a certain group of characters (for example, a character in a predefined set or range, a whitespace character, an alphabetic, numeric, or underscore character, or a character that is not in one of these groups.
Character RepresentationMetacharacters that represent a special character (for example, backslash, new line, tab, hexadecimal values, any untranslated literal character, etc.
Grouping OperatorsA grouping of letters or metacharacters to apply a regular expression operator to.
Nonmatching OperatorsText included in an expression for the purpose of adding a comment statement, but not to be used as a pattern to find a match for.
Positional OperatorsLocation in the string where the characters or pattern must be positioned for there to be a match (for example, start or end of the string, start or end of a word, an entire word).
Lookaround OperatorsCharacters or patterns that immediately precede or follow the intended match, but are not considered to be part of the match itself.
QuantifiersVarious ways of expressing the number of times a character or pattern is to occur for there to be a match (for example, exact number, minimum, maximum, zero or one, zero or more, one or more, etc.)
TokensCharacters or patterns selected from the string being parsed that you can use to match other characters in the string.
Named CaptureOperators used in assigning names to matched tokens, thus making your code more maintainable and the output easier to interpret.
Conditional ExpressionsOperators that express conditions under which a certain match is considered to be is acceptable.
Dynamic Regular ExpressionsOperators that include a subexpression or command that MATLAB parses or executes. MATLAB uses the result of that operation in parsing the overall expression.
String ReplacementOperators used with the regexprep function to specify the content of the replacement text.

Any text in the expression must be an exact match for at least part of the text in the parse string. Operators, on the other hand, are symbolic. Each operator symbol stands for a type of character (for example, an uppercase letter ([A-Z]), a space character (\s), four characters of any type (.{4})).

outselect

An optional comma-separated list of one to seven keywords. The presence of any of these keywords in the input argument list tells regexp to return the corresponding output.

outselect KeywordSelected Return Value
'start'matchstart
'end'matchend
'tokenExtents'tokenindices
'match'matchstring
'tokens'tokenstring
'names'tokenname
'split'splitstring

You must supply one output variable for each outselect keyword you include as an input argument. The order of the keyword in the input argument list determines the order of the corresponding output in the output argument list.

For a description of all return values, see Output Arguments .

options

regexp accepts one or more of the following options.

Command Options

OptionDescription
'once'

Return only the first match found.

'warnings'

Display warnings regarding potential issues parsing the expression. This option only enables warnings for the command being executed.

mode

Modify the parameters of a search. See the section on mode that follows this table.

The mode Option.  You can fine-tune your regular expression parsing using the optional mode inputs: Case Sensitivity, Empty Match, Dot Matching, Anchor Type, and Spacing. There are two ways to use the regexp modes:

  • Use a mode keyword (for example, 'lineanchors') to apply the mode to the entire input string.

  • Use a mode flag (for example, (?m)) to apply the mode to selected parts of the input string. (Not available for Empty Match mode.)

The mode option is available for the regexp, regexpi, and regexprep functions. For more information about regexp modes, see Modifying Parameters of the Search (Modes) in the MATLAB "Programming Fundamentals" documentation.

Case Sensitivity Mode

Use the Case Sensitivity mode to control whether or not MATLAB considers letter case when matching an expression to a string.

Mode Keyword

Flag

Description

'matchcase'(?-i)

Letter case must match when matching patterns to a string. (The default for regexp).

'ignorecase'(?i)

Do not consider letter case when matching patterns to a string. (The default for regexpi).

Empty Match Mode

Use Empty Match mode to allow successful matches of length zero. Using this mode, you can match a location in a string in which nothing but the assertion is true.

Mode Keyword

Description

'noemptymatch'

Ignore zero length matches. (This is the default).

'emptymatch'

Allow matches of length zero.

Dot Matching Mode

Use the Dot Matching mode to control whether or not MATLAB includes the newline (\n) character when matching the dot (.) metacharacter in a regular expression.

Mode Keyword

Flag

Description

'dotall'(?s)

Match dot ('.') in the pattern string with any character. (This is the default).

'dotexceptnewline'(?-s)

Match dot in the pattern with any character that is not a newline.

Anchor Type Mode

Use the Anchor Type mode to control whether MATLAB considers the ^ and $ metacharacters to represent the beginning and end of a string or the beginning and end of a line.

Mode Keyword

Flag

Description

'stringanchors'(?-m)

Match the ^ and $ metacharacters at the beginning and end of a string. (This is the default).

'lineanchors'(?m)

Match the ^ and $ metacharacters at the beginning and end of a line.

Spacing Mode

Use the Spacing mode to control how MATLAB interprets space characters and comments within the parsing string. Note that spacing mode applies to the parsing string (the second input argument that contains the metacharacters (for example, \w ) and not the string being parsed.

Mode Keyword

Flag

Description

'literalspacing'(?-x)

Parse space characters and comments (the # character and any text to the right of it) in the same way as any other characters in the string. (This is the default).

'freespacing'(?x)

Ignore spaces and comments when parsing the string. (You must use '\ ' and '\#' to match space and # characters.)

Output Arguments

Each of the first seven outputs listed below is returned as a 1-by-m array, where m is the number of matches found by regexp.

When parsing multiple strings (i.e., a 1-by-n or n-by-1 cell array of strings), regexp returns a 1-by-n cell array for each output specified. When parsing an m-by-n cell array of strings, regexp returns an m-by-n cell array for each output specified.

matchstart

The starting index of each substring of string that matches expression, expr. The output is an array of class double.

matchend

The ending index of each substring of string that matches expression, expr. The output is an array of class double.

To obtain this return value, you must call regexp with at least two output variables, or specify the 'end' keyword as one of the outselect input arguments.

tokenindices

The starting and ending indices of each substring of string that matches a token in expr. The output is a cell array, each cell of which contains an array of class double, The size of this inner array is m-by-2, where m is the number of tokens captured by the match. Any cells that do not represent matched tokens are returned empty. (The output is a double when you call regexp with the 'once' option.)

To obtain this return value, you must call regexp with at least three output variables, or specify the 'tokenExtents' keyword as one of the outselect input arguments.

matchstring

The text of each substring of string that is a match. The output is a cell array of strings, (or a single string when you call regexp with the 'once' option).

To obtain this return value, you must call regexp with at least four output variables, or specify the 'match' keyword as one of the outselect input arguments.

tokenstring

The text of each token captured by regexp. The output is a cell array of 1-by-n cell arrays, where n is the number of token expressions specified in the expr input. (The output is a cell array of strings when you call regexp with the 'once' option).

To obtain this return value, you must call regexp with at least five output variables, or specify the 'tokens' keyword as one of the outselect input arguments.

tokenname

The name and text of each named token captured by regexp. The output is an array of structures, each containing n fields, where n is the number of token expressions specified in the expr input. Field names of the returned structure are set to the token names, and field values are the text of those tokens. Named tokens are generated by the expression (?<tokenname>). If there are no named tokens in expr, regexp returns a structure array with no fields.

To obtain this return value, you must call regexp with at least six output variables, or specify the 'names' keyword as one of the outselect input arguments.

splitstring

Those parts of the input string that are delimited by substrings matched by the expression. The output is a cell array of strings. Any cells that represent strings of zero length are returned empty. When using the 'split' keyword, regexp returns one more string than the number of matches in the string.

To obtain this return value, you must call regexp with all seven output variables, or specify the 'split' keyword as one of outselect the input arguments.

selected_outputs

From one to seven of the output values listed above. Select which outputs to include using the outselect input arguments. Specify which outputs you want returned and in what order you want them returned using the keywords shown in the table.

For example, to have regexp return the starting index and matching text, specify the keywords 'start' and 'match'

[index, matstr] = regexp('Jan 24','\d',...
'start','match')
Selected OutputsKeyword Required
matchstart'start'
matchend'end'
tokenindices'tokenExtents'
matchstring'match'
tokenstring'tokens'
tokenname'names'
splitstring'split'

Examples

Return a row vector of indices that match words that start with c, end with t, and contain one or more vowels between them. Make the matches sensitive to letter case (by using the case-sensitive regexp):

str = 'bat cat can car COAT court cut ct CAT-scan';
regexp(str, 'c[aeiou]+t')
ans =
     5    28
 

Return a cell array of row vectors of indices that match capital letters and white spaces in the cell array of strings str:

str = {'Madrid, Spain' 'Romeo and Juliet' 'MATLAB is great'};
s1 = regexp(str, '[A-Z]');
s2 = regexp(str, '\s');

Capital letters, '[A-Z]', were found at these str indices:

s1{:}
ans =
     1     9
ans =
     1    11
ans =
     1     2     3     4     5     6

Space characters, '\s', were found at these str indices:

s2{:}
ans =
     8
ans =
     6    10
ans =
     7    10
 

Return the text and the starting and ending indices of words containing the letter x:

str = 'REGEXP helps you relax';
[m s e] = regexp(str, '\w*x\w*', 'match', 'start', 'end')
m = 
    'relax'
s =
     18
e =
     22
 

Find the substrings delimited by the ^ character:

s1 = ['Use REGEXP to split ^this string into ' ...
      'several ^individual pieces'];

s2 = regexp(s1, '\^', 'split');

s2(:)
ans = 
    'Use REGEXP to split '
    'this string into several '
    'individual pieces'
 

Use the 'split' keyword to return those parts of the input string that are not returned when using 'match'. Note that when you match the beginning or ending characters in a string (as is done in this example), the first (or last) return value is always an empty string:

str = 'She sells sea shells by the seashore.';

[matchstr splitstr] = regexp(str, '[Ss]h.', 'match', ...
                             'split')
matchstr = 
    'She'    'she'    'sho'
splitstr = 
     ''    ' sells sea '    'lls by the sea'    're.'

For any string that has been split, you can reassemble the pieces into the initial string using the command

j = [splitstr; [matchstr {''}]]; [j{:}]

ans =
   She sells sea shells by the seashore.
 

Search a string for opening and closing HTML tags. Use the expression <(\w+) to find the opening tag (for example, '<tagname') and to create a token for it. Use the expression </\1> to find another occurrence of the same token, but formatted as a closing tag (for example, '</tagname>'):

str = ['if <code>A</code> == x<sup>2</sup>, ' ...
       '<em>disp(x)</em>']
str =
if <code>A</code> == x<sup>2</sup>, <em>disp(x)</em>

expr = '<(\w+).*?>.*?</\1>';

[tok mat] = regexp(str, expr, 'tokens', 'match');

tok{:}
ans = 
    'code'
ans = 
    'sup'
ans = 
    'em'

mat{:}
ans =
    <code>A</code>
ans =
    <sup>2</sup>
ans =
    <em>disp(x)</em>

For information on using tokens, see Tokens in the MATLAB Programming Fundamentals documentation.

 
  1. Enter a string containing two names, the first and last names being in a different order:

    str = sprintf('John Davis\nRogers, James')
    str =
        John Davis
        Rogers, James
  2. Create an expression that generates first and last name tokens, assigning the names first and last to the tokens. Call regexp to get the text and names of each token found:

    expr = ...
       '(?<first>\w+)\s+(?<last>\w+)|(?<last>\w+),\s+(?<first>\w+)';
    
    [tokens names] = regexp(str, expr, 'tokens', 'names');
  3. Examine the tokens cell array that was returned. The first and last name tokens appear in the order in which they were generated: first name–last name, then last name–first name:

    tokens{:}
    ans = 
        'John'    'Davis'
    ans = 
        'Rogers'    'James'
  4. Now examine the names structure that was returned. First and last names appear in a more usable order:

    names(:,1)
    ans = 
        first: 'John'
         last: 'Davis'
    
    names(:,2)
    ans = 
        first: 'James'
         last: 'Rogers'
 

Use Case Sensitivity mode with regexp.

  1. Given a string that has both uppercase and lowercase letters, use the regexp default mode (case-sensitive) to locate only the lowercase instance of the word case:

    str = 'A string with UPPERCASE and lowercase text.';
    
    regexp(str, 'case', 'match')
    ans = 
        'case'
  2. Now disable case-sensitive matching to find both instances of case:

    regexp(str, 'case', 'ignorecase', 'match')
    ans = 
        'CASE'    'case'
  3. Match 5 letters that are followed by 'CASE'. Use the (?-i) flag to turn off case insensitivity for the first match and (?i) to turn insensitivity on for the second:

    M = regexp(str, {'(?-i)\w{5}(?=CASE)', ...
                     '(?i)\w{5}(?=CASE)'}, 'match');
    
    M{:}
    ans = 
        'UPPER'
    ans = 
        'UPPER'    'lower'
 

Use Empty Match mode with regexp to add the characters '=> 'to the beginning of each line in the following text:

text = {'Use the REGEXPREP function in '; ...
  'Empty Match mode to insert a right '; ...
  'arrow at the start of each line.'};

newtext = regexprep(text, '^', '=> ', 'emptymatch')

newtext =
'=> Use the REGEXPREP function in '
'=> Empty Match mode to insert a right '
'=> arrow at the start of each line.'

The ^ operator matches the beginning of each string. Because there are no other characters to match, and the emptymatch mode is enabled by the fifth input argument of the command, there is an empty match at the beginning of each string.

 
  1. Search the string 'MATLAB' for zero or more of the characters M, A, and T. Do not include matches of zero-length. (The mode defaults to noemptymatch):

    regexp('MATLAB', '[MAT]*', 'match')	
    ans = 
        'MAT'    'A'
  2. Repeat this search, but this time specify emptymatch mode to include matches of zero-length.

      Note   Note that there is also a match for the position that follows the final character.

    regexp('MATLAB', '[MAT]*', 'emptymatch', 'match')
    ans = 
        'MAT'    ''    'A'    ''    ''
    
  3. Now, use the + (one or more) operator instead of the * (zero or more) operator. Continue to use emptymatch mode:

    regexp('MATLAB', '[MAT]+', 'emptymatch', 'match')
    ans = 
        'MAT'    'A'

    No match is found for the L and B characters this time because regexp needs to find a least one occurrence of these characters for a match to be found.

 

Use Dot Matching mode with regexp to parse the following string, which contains a newline (\n) character:

str = sprintf('abc\ndef')
str =
   abc
   def

When you use the default mode, dotall, MATLAB includes the newline in the characters matched:

regexp(str, '.', 'match')
ans = 
   'a'   'b'   'c'   [1x1 char]   'd'   'e'   'f'

When you use the dotexceptnewline mode, MATLAB skips the newline character:

regexp(str, '.', 'match', 'dotexceptnewline')
ans = 
   'a'   'b'   'c'   'd'   'e'   'f'
 

Use Anchor Type mode with regexp to control whether you parse a multiline string line by line, or once for the entire string. Use the following two-line string:

str = sprintf('%s\n%s', 'Here is the first line', ...
              'followed by the second line')
str =
   Here is the first line
   followed by the second line

In stringanchors mode, MATLAB interprets the $ metacharacter as an end-of-string specifier, and thus finds the last two words of the entire string:

regexp(str, '\w+\W\w+$', 'match', 'stringanchors')
ans = 
    'second line'

While in lineanchors mode, MATLAB interprets $ as an end-of-line specifier, and finds the last two words of each line:

regexp(str, '\w+\W\w+$', 'match', 'lineanchors')
ans = 
    'first line'    'second line'
 

Use Spacing mode with regexp to read a regular expression from a file, retrieving all operators and metacharacters from the file, but skipping any comment text:

  1. Create a file called regexp_str.txt containing the following text.

    (?x)    # turn on freespacing.
    
    # This pattern matches a string with a repeated letter.
    
    \w*     # First, match any number of preceding word characters.
        
    (       # Mark a token.
    .       # Match a character of any type.
    )       # Finish capturing said token.
    \1      # Backreference to match what token #1 matched.
    
    \w*     # Finally, match the remainder of the word.
  2. Use the pattern expression read from the file to find those words that have consecutive matching letters. Because the first line enables freespacing mode, MATLAB ignores all spaces and comments that appear in the file:

    str = ['Looking for words with letters that ' ...
           'appear twice in succession.'];
    patt = fileread('regexp_str.txt');
    
    regexp(str, patt, 'match')
    ans = 
        'Looking'    'letters'    'appear'    'succession'
 

Debug problems in parsing a string with regexp, regexpi, or regexprep, using the 'warnings' option to view all warning messages:

regexp('$.', '[a-]','warnings')
Warning: Unbound range.
 [a-]
   | 

See Also

regexpi | regexprep | regexptranslate | strcmp | strcmpi | strfind | strncmp | strncmpi

How To

  


Free MATLAB Interactive Kit

Explore how to use MATLAB to make advancements in engineering and science.


Download free kit

Trials Available

Try the latest version of MATLAB and other MathWorks products.


Get trial software
 © 1984-2012- The MathWorks, Inc.    -   Site Help   -   Patents   -   Trademarks   -   Privacy Policy   -   Preventing Piracy   -   RSS