Regexp to extract all characters in a varied string up to match.

8 views (last 30 days)
Hello userbase,
I'm new to regexes. I'm working with some transistor test data and trying to extract information from .csv file names for sorting prior to further probing.
They have often a format such as this:
target = Some Test Performed [12345678987_HS1 (further info including dates and temperatures)].csv
target = Some Other Test [123456_LS (further info including dates and temperatures)].csv
I want to extract the entire string up to the HS variant, including the optional number that follows it, as this represents the device and test. The further info relates to parameters.
The Some Test Performed section can be single or multiple words, contain special characters (&-_).
I'm looking for HS, LS, HS1, HS2, HS3, LS1, LS2, LS3.
I've tried lookbehind assertions, but it feels cludgy and I've guessed a bit:
pattern = '(?<=((HS)|(HS)\d|(LS)|(LS)\d))\s'
How can I improve this?
What does the ? normally do? (I see that here is a special case for the lookaround.)
My desired regexp(target, pattern, 'match') output would be:
match = Some Test Performed [12345678987_HS1
match = Some Other Test [123456_HS
Or at least the index of the final character so I could use target{1:match} to extract my string. Is there some useful 'from start or target until match' metacharacter?
Best regards and thanks for reading, Marshall

Accepted Answer

Geoff Hayes
Geoff Hayes on 12 Nov 2014
Marshall - if all of your target strings (the csv filenames) have an open bracket *(* in them, and you want all the characters before that, then you could use a strfind call to get the index of the open bracket, and then copy all characters up to that index. Something like
target = 'Some Test Performed [12345678987_HS1 (further info including dates and temperatures)].csv';
idx = strfind(target, '(');
if ~isempty(idx)
match = strtrim(target(1:idx-1));
end
which would return
match =
Some Test Performed [12345678987_HS1
However, if the open bracket rule is not valid for all cases, then you could try simplifying your pattern to
pattern = '.+[HL]S[\d\s]';
where
.+ means match on one or more single characters including whitespace (the plus sign means one or more);
[HL]S means a single character match on either H or L followed by an S; and
[\d\s] means match on either a single numeric character or any whitespace character.
So with your two target strings above, using this pattern we would see
target1 = 'Some Test Performed [12345678987_HS1 (further info including dates and temperatures)].csv';
target2 = 'Some Other Test [123456_LS (further info including dates and temperatures)].csv';
pattern = '.+[HL]S[\d\s]';
match1 = regexp(target1,pattern,'match');
match2 = regexp(target2,pattern,'match');
with
match1 =
'Some Test Performed [12345678987_HS1'
match2 =
'Some Other Test [123456_LS '
A problem with the above pattern may occur when there are additional HS or LS characters that follow the first pattern match. For example, if your target is
target3 = 'Some HS Test Performed [12345678987_HS1 (further info including dates and HS temperatures)].csv';
match3 = regexp(target3,pattern,'match')
then string is found to be
match3 =
'Some HS Test Performed [12345678987_HS1 (further info including dates and HS '
So you may want to narrow down the pattern to that where a numeric string followed by an underscore precedes your original pattern
newPattern = '.+\d+_[HL]S[\d\s]';
match3 = regexp(target3,newPattern,'match')
which returns the desired
match3 =
'Some HS Test Performed [12345678987_HS1'
This new pattern will work for the other two targets as well.
Note that for the second match, we have a trailing whitespace character. You may want to wrap your regexp with a strtrim to remove it.
  2 Comments
Marshall
Marshall on 13 Nov 2014
Hi, that's a great and thorough answer. Thanks for taking the time to explain the metacharacters too and to guess that the bracket after HS/LS isn't the standard case (it isn't)
And if I exclude the 'match' operator, the reason regexp returns [1] is because the start of that pattern begins at the start of the string?
strtrim is a good suggestion too. Thanks again :)
Geoff Hayes
Geoff Hayes on 13 Nov 2014
Glad to be able to help, Marshall. And yes, the [1] is returned when you remove the 'match' option because [1] is the start index of the pattern.

Sign in to comment.

More Answers (0)

Categories

Find more on Characters and Strings in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!