| Contents | Index |
regexp(string,expr)
[matchstart,matchend,tokenindices,matchstring,tokenstring,tokenname,splitstring]
= regexp(string,expr)
[selected_outputs] = regexp(string,expr,outselect)
regexp(string,expr,options)
regexp(string,expr) parses the input string locating those parts of string that match the character pattern(s) specified by regular expression, expr. This syntax returns the starting index of each match. If no matches are found, regexp returns an empty array.
[matchstart,matchend,tokenindices,matchstring,tokenstring,tokenname,splitstring] = regexp(string,expr) returns from one to seven output values, depending on the number of output variables you specify.
[selected_outputs] = regexp(string,expr,outselect) returns from one to seven output values, depending on which flags you specify in outselect. The presence and ordering of the outselect inputs determine the presence and ordering of the corresponding outputs.
regexp(string,expr,options) calls regexp with one or more of the nondefault options listed in the Command Options table, below. These options must follow the string and expr inputs in the argument list.
For more information on this and other regular expression functions, see Regular Expressions in the MATLAB Programming Fundamentals documentation.
The string and expr inputs are required. Enter these as the first and second arguments, respectively. Any other input arguments are optional. Enter optional inputs, in any order, following the two required inputs.
string |
A string or cell array of strings containing the string you want to parse. This string can be of any length and can contain any characters. This argument can also be a cell array of strings. | ||||||||||||||||||||||||||||||||||||||||||||||||||
expr |
A string or cell array of strings containing a MATLAB regular expression. This input consists of text and operators with which you specify character patterns to look for in string. This table shows the main categories of metacharacters and operators that you can use in expr.
Any text in the expression must be an exact match for at least part of the text in the parse string. Operators, on the other hand, are symbolic. Each operator symbol stands for a type of character (for example, an uppercase letter ([A-Z]), a space character (\s), four characters of any type (.{4})). | ||||||||||||||||||||||||||||||||||||||||||||||||||
outselect |
An optional comma-separated list of one to seven keywords. The presence of any of these keywords in the input argument list tells regexp to return the corresponding output.
You must supply one output variable for each outselect keyword you include as an input argument. The order of the keyword in the input argument list determines the order of the corresponding output in the output argument list. For a description of all return values, see Output Arguments . | ||||||||||||||||||||||||||||||||||||||||||||||||||
options |
regexp accepts one or more of the following options. Command Options
The mode Option. You can fine-tune your regular expression parsing using the optional mode inputs: Case Sensitivity, Empty Match, Dot Matching, Anchor Type, and Spacing. There are two ways to use the regexp modes:
The mode option is available for the regexp, regexpi, and regexprep functions. For more information about regexp modes, see Modifying Parameters of the Search (Modes) in the MATLAB "Programming Fundamentals" documentation.
|
Each of the first seven outputs listed below is returned as a 1-by-m array, where m is the number of matches found by regexp.
When parsing multiple strings (i.e., a 1-by-n or n-by-1 cell array of strings), regexp returns a 1-by-n cell array for each output specified. When parsing an m-by-n cell array of strings, regexp returns an m-by-n cell array for each output specified.
matchstart |
The starting index of each substring of string that matches expression, expr. The output is an array of class double. | ||||||||||||||||
matchend |
The ending index of each substring of string that matches expression, expr. The output is an array of class double. To obtain this return value, you must call regexp with at least two output variables, or specify the 'end' keyword as one of the outselect input arguments. | ||||||||||||||||
tokenindices |
The starting and ending indices of each substring of string that matches a token in expr. The output is a cell array, each cell of which contains an array of class double, The size of this inner array is m-by-2, where m is the number of tokens captured by the match. Any cells that do not represent matched tokens are returned empty. (The output is a double when you call regexp with the 'once' option.) To obtain this return value, you must call regexp with at least three output variables, or specify the 'tokenExtents' keyword as one of the outselect input arguments. | ||||||||||||||||
matchstring |
The text of each substring of string that is a match. The output is a cell array of strings, (or a single string when you call regexp with the 'once' option). To obtain this return value, you must call regexp with at least four output variables, or specify the 'match' keyword as one of the outselect input arguments. | ||||||||||||||||
tokenstring |
The text of each token captured by regexp. The output is a cell array of 1-by-n cell arrays, where n is the number of token expressions specified in the expr input. (The output is a cell array of strings when you call regexp with the 'once' option). To obtain this return value, you must call regexp with at least five output variables, or specify the 'tokens' keyword as one of the outselect input arguments. | ||||||||||||||||
tokenname |
The name and text of each named token captured by regexp. The output is an array of structures, each containing n fields, where n is the number of token expressions specified in the expr input. Field names of the returned structure are set to the token names, and field values are the text of those tokens. Named tokens are generated by the expression (?<tokenname>). If there are no named tokens in expr, regexp returns a structure array with no fields. To obtain this return value, you must call regexp with at least six output variables, or specify the 'names' keyword as one of the outselect input arguments. | ||||||||||||||||
splitstring |
Those parts of the input string that are delimited by substrings matched by the expression. The output is a cell array of strings. Any cells that represent strings of zero length are returned empty. When using the 'split' keyword, regexp returns one more string than the number of matches in the string. To obtain this return value, you must call regexp with all seven output variables, or specify the 'split' keyword as one of outselect the input arguments. | ||||||||||||||||
selected_outputs |
From one to seven of the output values listed above. Select which outputs to include using the outselect input arguments. Specify which outputs you want returned and in what order you want them returned using the keywords shown in the table. For example, to have regexp return the starting index and matching text, specify the keywords 'start' and 'match' [index, matstr] = regexp('Jan 24','\d',...
'start','match')
|
Return a row vector of indices that match words that start with c, end with t, and contain one or more vowels between them. Make the matches sensitive to letter case (by using the case-sensitive regexp):
str = 'bat cat can car COAT court cut ct CAT-scan';
regexp(str, 'c[aeiou]+t')
ans =
5 28Return a cell array of row vectors of indices that match capital letters and white spaces in the cell array of strings str:
str = {'Madrid, Spain' 'Romeo and Juliet' 'MATLAB is great'};
s1 = regexp(str, '[A-Z]');
s2 = regexp(str, '\s');Capital letters, '[A-Z]', were found at these str indices:
s1{:}
ans =
1 9
ans =
1 11
ans =
1 2 3 4 5 6Space characters, '\s', were found at these str indices:
s2{:}
ans =
8
ans =
6 10
ans =
7 10Return the text and the starting and ending indices of words containing the letter x:
str = 'REGEXP helps you relax';
[m s e] = regexp(str, '\w*x\w*', 'match', 'start', 'end')
m =
'relax'
s =
18
e =
22Find the substrings delimited by the ^ character:
s1 = ['Use REGEXP to split ^this string into ' ...
'several ^individual pieces'];
s2 = regexp(s1, '\^', 'split');
s2(:)
ans =
'Use REGEXP to split '
'this string into several '
'individual pieces'Use the 'split' keyword to return those parts of the input string that are not returned when using 'match'. Note that when you match the beginning or ending characters in a string (as is done in this example), the first (or last) return value is always an empty string:
str = 'She sells sea shells by the seashore.';
[matchstr splitstr] = regexp(str, '[Ss]h.', 'match', ...
'split')
matchstr =
'She' 'she' 'sho'
splitstr =
'' ' sells sea ' 'lls by the sea' 're.'For any string that has been split, you can reassemble the pieces into the initial string using the command
j = [splitstr; [matchstr {''}]]; [j{:}]
ans =
She sells sea shells by the seashore.Search a string for opening and closing HTML tags. Use the expression <(\w+) to find the opening tag (for example, '<tagname') and to create a token for it. Use the expression </\1> to find another occurrence of the same token, but formatted as a closing tag (for example, '</tagname>'):
str = ['if <code>A</code> == x<sup>2</sup>, ' ...
'<em>disp(x)</em>']
str =
if <code>A</code> == x<sup>2</sup>, <em>disp(x)</em>
expr = '<(\w+).*?>.*?</\1>';
[tok mat] = regexp(str, expr, 'tokens', 'match');
tok{:}
ans =
'code'
ans =
'sup'
ans =
'em'
mat{:}
ans =
<code>A</code>
ans =
<sup>2</sup>
ans =
<em>disp(x)</em>For information on using tokens, see Tokens in the MATLAB Programming Fundamentals documentation.
Enter a string containing two names, the first and last names being in a different order:
str = sprintf('John Davis\nRogers, James')
str =
John Davis
Rogers, JamesCreate an expression that generates first and last name tokens, assigning the names first and last to the tokens. Call regexp to get the text and names of each token found:
expr = ... '(?<first>\w+)\s+(?<last>\w+)|(?<last>\w+),\s+(?<first>\w+)'; [tokens names] = regexp(str, expr, 'tokens', 'names');
Examine the tokens cell array that was returned. The first and last name tokens appear in the order in which they were generated: first name–last name, then last name–first name:
tokens{:}
ans =
'John' 'Davis'
ans =
'Rogers' 'James'Now examine the names structure that was returned. First and last names appear in a more usable order:
names(:,1)
ans =
first: 'John'
last: 'Davis'
names(:,2)
ans =
first: 'James'
last: 'Rogers'Use Case Sensitivity mode with regexp.
Given a string that has both uppercase and lowercase letters, use the regexp default mode (case-sensitive) to locate only the lowercase instance of the word case:
str = 'A string with UPPERCASE and lowercase text.';
regexp(str, 'case', 'match')
ans =
'case'Now disable case-sensitive matching to find both instances of case:
regexp(str, 'case', 'ignorecase', 'match')
ans =
'CASE' 'case'Match 5 letters that are followed by 'CASE'. Use the (?-i) flag to turn off case insensitivity for the first match and (?i) to turn insensitivity on for the second:
M = regexp(str, {'(?-i)\w{5}(?=CASE)', ...
'(?i)\w{5}(?=CASE)'}, 'match');
M{:}
ans =
'UPPER'
ans =
'UPPER' 'lower'Use Empty Match mode with regexp to add the characters '=> 'to the beginning of each line in the following text:
text = {'Use the REGEXPREP function in '; ...
'Empty Match mode to insert a right '; ...
'arrow at the start of each line.'};
newtext = regexprep(text, '^', '=> ', 'emptymatch')
newtext =
'=> Use the REGEXPREP function in '
'=> Empty Match mode to insert a right '
'=> arrow at the start of each line.'
The ^ operator matches the beginning of each string. Because there are no other characters to match, and the emptymatch mode is enabled by the fifth input argument of the command, there is an empty match at the beginning of each string.
Search the string 'MATLAB' for zero or more of the characters M, A, and T. Do not include matches of zero-length. (The mode defaults to noemptymatch):
regexp('MATLAB', '[MAT]*', 'match')
ans =
'MAT' 'A'Repeat this search, but this time specify emptymatch mode to include matches of zero-length.
regexp('MATLAB', '[MAT]*', 'emptymatch', 'match')
ans =
'MAT' '' 'A' '' ''
Now, use the + (one or more) operator instead of the * (zero or more) operator. Continue to use emptymatch mode:
regexp('MATLAB', '[MAT]+', 'emptymatch', 'match')
ans =
'MAT' 'A'No match is found for the L and B characters this time because regexp needs to find a least one occurrence of these characters for a match to be found.
Use Dot Matching mode with regexp to parse the following string, which contains a newline (\n) character:
str = sprintf('abc\ndef')
str =
abc
defWhen you use the default mode, dotall, MATLAB includes the newline in the characters matched:
regexp(str, '.', 'match') ans = 'a' 'b' 'c' [1x1 char] 'd' 'e' 'f'
When you use the dotexceptnewline mode, MATLAB skips the newline character:
regexp(str, '.', 'match', 'dotexceptnewline') ans = 'a' 'b' 'c' 'd' 'e' 'f'
Use Anchor Type mode with regexp to control whether you parse a multiline string line by line, or once for the entire string. Use the following two-line string:
str = sprintf('%s\n%s', 'Here is the first line', ...
'followed by the second line')
str =
Here is the first line
followed by the second lineIn stringanchors mode, MATLAB interprets the $ metacharacter as an end-of-string specifier, and thus finds the last two words of the entire string:
regexp(str, '\w+\W\w+$', 'match', 'stringanchors')
ans =
'second line'While in lineanchors mode, MATLAB interprets $ as an end-of-line specifier, and finds the last two words of each line:
regexp(str, '\w+\W\w+$', 'match', 'lineanchors')
ans =
'first line' 'second line'Use Spacing mode with regexp to read a regular expression from a file, retrieving all operators and metacharacters from the file, but skipping any comment text:
Create a file called regexp_str.txt containing the following text.
(?x) # turn on freespacing.
# This pattern matches a string with a repeated letter.
\w* # First, match any number of preceding word characters.
( # Mark a token.
. # Match a character of any type.
) # Finish capturing said token.
\1 # Backreference to match what token #1 matched.
\w* # Finally, match the remainder of the word.Use the pattern expression read from the file to find those words that have consecutive matching letters. Because the first line enables freespacing mode, MATLAB ignores all spaces and comments that appear in the file:
str = ['Looking for words with letters that ' ...
'appear twice in succession.'];
patt = fileread('regexp_str.txt');
regexp(str, patt, 'match')
ans =
'Looking' 'letters' 'appear' 'succession'Debug problems in parsing a string with regexp, regexpi, or regexprep, using the 'warnings' option to view all warning messages:
regexp('$.', '[a-]','warnings')
Warning: Unbound range.
[a-]
| regexpi | regexprep | regexptranslate | strcmp | strcmpi | strfind | strncmp | strncmpi

Explore how to use MATLAB to make advancements in engineering and science.
| © 1984-2012- The MathWorks, Inc. - Site Help - Patents - Trademarks - Privacy Policy - Preventing Piracy - RSS |