Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Can REGEXP map values from different parts of a text file?

Asked by Brad on 5 Jun 2013

I have a text file with the following contents:

MSNout_BER (0:31) Observation #100 Rx'd at:  (58568.000) Msg. Time: (58568.000)
    Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rel Mode: Active
MSNout_SSS (0:32) Observation #101 Rx'd at:  (58569.000) Msg. Time: (58569.000)
    Forward to IRU: true   Rcv Date: 2010121   Synch: a0a0   Bel Mode: High
Type: 12    Malck ID: 12345 Time Tag: 58548.12345678
Hand ID: 0  SV ID:   51 Spam ID: 0  BOZ/FAS: 0  Realt Flag: 0
MSNout_BER (0:33) Observation #102 Rx'd at:  (58570.000) Msg. Time: (58570.000)
    Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rel Mode: Active
MSNout_SSS (0:34) Observation #103 Rx'd at:  (58571.000) Msg. Time: (58571.000)
    Forward to IRU: true   Rcv Date: 2010121   Synch: a0a0   Bel Mode: High
Type: 1 Malck ID: 12345 Time Tag: 58549.12345678
Hand ID: 1  SV ID:   2  Spam ID: 0  BOZ/FAS: 1  Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58550.12345678
Hand ID: 1  SV ID:   2  Spam ID: 0  BOZ/FAS: 1  Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58551.12345678
Hand ID: 1  SV ID:   2  Spam ID: 0  BOZ/FAS: 1  Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58552.12345678
Hand ID: 1  SV ID:   2  Spam ID: 0  BOZ/FAS: 1  Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58553.12345678
Hand ID: 1  SV ID:   1  Spam ID: 0  BOZ/FAS: 1  Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58554.12345678
Hand ID: 1  SV ID:   1  Spam ID: 0  BOZ/FAS: 1  Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58555.12345678
Hand ID: 1  SV ID:   1  Spam ID: 0  BOZ/FAS: 1  Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58556.12345678
Hand ID: 1  SV ID:   3  Spam ID: 0  BOZ/FAS: 1  Realt Flag: 0

I’m using the following commands to retrieve the values for the Time Tag: and SV ID: (values 1 and 2 only, all others are ignored);

[fn,pn] = uigetfile('*.txt,"Select Text File');
OAMfilename = fullfile(pn, fn);
buffer  = fileread(OAMfilename);
pattern = '*?Tag:\s+([\d\.]+).*?SV ID:\s+([12])\W';
tokens = regexp(buffer, pattern, 'tokens');
data = reshape(str2double([tokens{:}]), 2, []).';

Results:

58548.1234567800	2
58550.1234567800	2
58551.1234567800	2
58552.1234567800	2
58553.1234567800	1
58554.1234567800	1
58555.1234567800	1

Initially, I thought the results were as expected. Then I noticed the time tag for the first occurrence of SV ID equal to 2 was wrong - 58549.12345678 is the proper time tag.

Is it possible to force MATLAB to recognize each Time Tag value that occurs just prior to each SV ID value? Could a Lookaround operator be used in this case?

0 Comments

Brad

Products

1 Answer

Answer by per isakson on 7 Jun 2013
Edited by per isakson on 10 Jun 2013
Accepted answer

This seems to work.

    buf = fileread( 'cssm.txt' );
    rex = '(?<=Time Tag: )([\d\.]+).+?(?<=SV ID:[ ]+)(\d+)';
    cac = regexp( buf, rex, 'tokens' );
    cac{:}

returns

    ans = 
        '58548.12345678'    '51'
    ans = 
        '58549.12345678'    '2'
    ans = 
        '58550.12345678'    '2'
    ans = 
        '58551.12345678'    '2'
    ans = 
        '58552.12345678'    '2'
    ans = 
        '58553.12345678'    '1'
    ans = 
        '58554.12345678'    '1'
    ans = 
        '58555.12345678'    '1'
    ans = 
        '58556.12345678'    '3'

where cssm.txt contains your data

.

Comments on the regular expression:

  • capture tokens
  • capture the group of digits, which follow after identifiers and space
  • the "identifiers and space" are used as expressions in look behind operators
  • thus two groups of (?<= name)( value)
  • between these two groups: .+?, which is a Lazy Quantifier. It advances the current position one position or more, but only as much of the quantified expression as necessary.
  • the regular expression must match one sub-string, thus something is needed to match the characters between the two groups to make the two one sub-string. In this case that is done by .+?.

Most of the italic words are copy&paste from the on-line help.

.

BTW: Your pattern works - after a little fixing:

    rex = '*?Tag:\s+([\d\.]+).*?SV ID:\s+([125]{1,2})\W';

but what is the purpose of the leading *? and the trailing \W ?

.

A bit more robust:

    rex = '(?<=Time Tag:)[ ]+([\d\.]+)[^\n]+?(?<=SV ID:)[ ]+(\d+)';
  • Replacing \s+ between name and value by [ ]+ excludes new-line, tab, etc.
  • Replacing .*? between the two name-value-pairs by [^\n]+? ensures that the two pairs are from the same line
  • IMO: "[ ]+" is more readable than " +"

9 Comments

per isakson on 13 Jun 2013

Quick comments:

  • I use R2012a
  • When copy&pasting the text to a text file, I have removed some line breaks
  • I don't understand why '([\d\.]+)\s+Hand.+?SV ID:\s+(\d+)' does not match the line with SV ID: followed by 51.
  • I prefer to use '(?<=Name)[ ]+(Value) to read name-value-pairs. I think it makes "better" code; it communicates intent better.
  • I have not read "Mastering Regular Expressions".
  • I think the expressions should be as selective as possible. Regular expressions often cause problems in my code; old code in combination with new text files produces unexpected results.

Does it help to replace the \W by \D?

Cedric Wannaz on 13 Jun 2013

Actually

 '([\d\.]+)\s+Hand.+?SV ID:\s+(\d+)'

does match SV ID 51.

What was wrong with your initial pattern is that the first match is the whole:

 Tag: 58548.12345678
 Hand ID: 0  SV ID:   51 Spam ID: 0  BOZ/FAS: 0  Realt Flag: 0
 MSNout_BER (0:33) Observation #102 Rx'd at:  (58570.000) Msg. Time: (58570.000)
 Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rel Mode: Active
 MSNout_SSS (0:34) Observation #103 Rx'd at:  (58571.000) Msg. Time: (58571.000)
 Forward to IRU: true   Rcv Date: 2010121   Synch: a0a0   Bel Mode: High
 Type: 1 Malck ID: 12345 Time Tag: 58549.12345678
 Hand ID: 1  SV ID:   2

(which gives time=58548.12345678 and SVID=2)

If you want to select only those with SV IDs 1 and 2, you can use

 '([\d\.]+)\s+Hand[^B]+?SV ID:\s+([12])'

which works based on the fact that there is no 'B' in between the time tag and the SV ID (it appears only after the SV ID in 'BOZ'). You could also use an expression that prevents another 'Time Tag' to appear in between the initial time tag and the SV ID, or limit the number of characters in between the tie tag and the SV ID (i.e. replace .+? with .{1,45}), but I think that ^B is simpler. Of course, you could just stick to the expression which matches all entries and then filter out those with SV IDs not in {1,2} after conversion to numeric.

Brad on 17 Jun 2013

Per, Cedric, after re-installing MATLAB I'm getting the proper results. I tried both approaches provided by the 2 of you and they run like a champ. Thanks for the help on this.

per isakson

Contact us