Regexp to extract all characters in a varied string up to match.

Question

Marshall on 12 Nov 2014

1
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/162373-regexp-to-extract-all-characters-in-a-varied-string-up-to-match

Commented: Geoff Hayes on 13 Nov 2014

Hello userbase,

I'm new to regexes. I'm working with some transistor test data and trying to extract information from .csv file names for sorting prior to further probing.

They have often a format such as this:

target = Some Test Performed [12345678987_HS1 (further info including dates and temperatures)].csv
target = Some Other Test [123456_LS (further info including dates and temperatures)].csv

I want to extract the entire string up to the HS variant, including the optional number that follows it, as this represents the device and test. The further info relates to parameters.

The Some Test Performed section can be single or multiple words, contain special characters (&-_).

I'm looking for HS, LS, HS1, HS2, HS3, LS1, LS2, LS3.

I've tried lookbehind assertions, but it feels cludgy and I've guessed a bit:

pattern = '(?<=((HS)|(HS)\d|(LS)|(LS)\d))\s'

How can I improve this?

What does the ? normally do? (I see that here is a special case for the lookaround.)

My desired regexp(target, pattern, 'match') output would be:

match = Some Test Performed [12345678987_HS1
match = Some Other Test [123456_HS

Or at least the index of the final character so I could use target{1:match} to extract my string. Is there some useful 'from start or target until match' metacharacter?

Best regards and thanks for reading, Marshall

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Geoff Hayes on 12 Nov 2014

2
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/162373-regexp-to-extract-all-characters-in-a-varied-string-up-to-match#answer_158655

Open in MATLAB Online

Marshall - if all of your target strings (the csv filenames) have an open bracket *(* in them, and you want all the characters before that, then you could use a strfind call to get the index of the open bracket, and then copy all characters up to that index. Something like

 target = 'Some Test Performed [12345678987_HS1 (further info including dates and temperatures)].csv';
 idx    = strfind(target, '(');
 if ~isempty(idx)
     match = strtrim(target(1:idx-1));
 end

which would return

 match =
    Some Test Performed [12345678987_HS1

However, if the open bracket rule is not valid for all cases, then you could try simplifying your pattern to

pattern = '.+[HL]S[\d\s]';

where

.+ means match on one or more single characters including whitespace (the plus sign means one or more);

[HL]S means a single character match on either H or L followed by an S; and

[\d\s] means match on either a single numeric character or any whitespace character.

So with your two target strings above, using this pattern we would see

 target1 = 'Some Test Performed [12345678987_HS1 (further info including dates and temperatures)].csv';
 target2 = 'Some Other Test [123456_LS (further info including dates and temperatures)].csv';
 pattern = '.+[HL]S[\d\s]';
 match1 = regexp(target1,pattern,'match');
 match2 = regexp(target2,pattern,'match');

with

 match1 = 
    'Some Test Performed [12345678987_HS1'
 match2 = 
    'Some Other Test [123456_LS '

A problem with the above pattern may occur when there are additional HS or LS characters that follow the first pattern match. For example, if your target is

 target3 = 'Some HS Test Performed [12345678987_HS1 (further info including dates and HS temperatures)].csv';
 match3 = regexp(target3,pattern,'match')

then string is found to be

 match3 = 
    'Some HS Test Performed [12345678987_HS1 (further info including dates and HS '

So you may want to narrow down the pattern to that where a numeric string followed by an underscore precedes your original pattern

 newPattern = '.+\d+_[HL]S[\d\s]'; 
 match3     = regexp(target3,newPattern,'match')

which returns the desired

 match3 = 
    'Some HS Test Performed [12345678987_HS1'

This new pattern will work for the other two targets as well.

Note that for the second match, we have a trailing whitespace character. You may want to wrap your regexp with a strtrim to remove it.

2 Comments
Show NoneHide None

Marshall on 13 Nov 2014

Hi, that's a great and thorough answer. Thanks for taking the time to explain the metacharacters too and to guess that the bracket after HS/LS isn't the standard case (it isn't)

And if I exclude the 'match' operator, the reason regexp returns [1] is because the start of that pattern begins at the start of the string?

strtrim is a good suggestion too. Thanks again :)

Geoff Hayes on 13 Nov 2014

Glad to be able to help, Marshall. And yes, the [1] is returned when you remove the 'match' option because [1] is the start index of the pattern.

Sign in to comment.

Regexp to extract all characters in a varied string up to match.

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

2 Comments
Show NoneHide None

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

Regexp to extract all characters in a varied string up to match.

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

2 Comments Show NoneHide None

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None