Using regexp to Search Large Text File for Wanted Data

Hello all! I am attempting to utilize regexp to extract wanted items from text files. Attached below is a sample text file. I wish to extract the start date, EL HGT and Northing (Y)/Easting (X) for UTM (Zone 15). I will eventually have a series of 365 of these text files compressed into one file. Is regexp the best method and how would it be coded? Thanks.

 Accepted Answer

is regexp the best method?
For EL HGT, probably. For your Northing/Easting for UTM(Zone 15), no because extracting that value involve crossreferencing rows and columns. For that you would have to parse the whole file, and you would have to write your own parser as none of matlab built-in parsers (textscan, etc.) can parse a file that complex as is.
However, if UTM(Zone 15) is always the first column of number for Northing/Easting, then yes you could use a regexp:
filecontent = fileread('sample.txt');
el_hgt_full = regexp(filecontent, '(?<=EL HGT:\s*)[^\n\r]*', 'match', 'once');
el_hgt = str2double(regexp(el_hgt_full, '[+-]?\d+(\.\d+)?', 'match'));
northing = str2double(regexp(filecontent, '(?<=Northing \(Y\) \[meters\]\s*)[+-]?\d+(\.\d+)?', 'match', 'once'));
easting = str2double(regexp(filecontent, '(?<=Easting \(X\) \[meters\]\s*)[+-]?\d+(\.\d+)?', 'match', 'once'));

6 Comments

Absolutely stellar work Guillaume. Could I also extract the starting data utilizing regexp()?
Here is an alternative based on the same approach but with slightly simpler patterns IMHO. I applied it to a file (attached) where I duplicated your file content:
filecontent = fileread( 'SampleExt.txt' ) ;
el_hgt = regexp( filecontent, 'EL HGT:\s+([\d\.]+)\D+([\d\.]+)\D+([\d\.]+)\D+([\d\.]+)', 'tokens' ) ;
el_hgt = str2double( vertcat( el_hgt{:} )) ;
northing = str2double( regexp( filecontent, '(?<=Northing\D+)[\d\.\-]+', 'match' )).' ;
easting = str2double( regexp( filecontent, '(?<=Easting\D+)[\d\.\-]+', 'match' )).' ;
startDate = datevec( regexp( filecontent, '(?<=START:\s+)[^\r\n]+', 'match' ), 'YYYY/mm/dd HH:MM:SS' ) ;
Note that if some files (that compose the large one) don't have all the parameters defined, you may want to split the content in blocks first, e.g. on the keyword "NGS OPUS", and then iterate through blocks and parse.
Nice work! It also turns out I need to extract 'OVERALL RMS' as well. I tried to do so with this code:
rms = regexp(filecontent,'OVERALL RMS[\s:]+(\d+','tokens')
and got no return. What exactly am I doing wrong? It seems I cannot write regular expressions x|
Not far :)
In short:
  • abcdABC : literal, matched as is.
  • \s : match one white space (including line break, carriage return, tab, etc).
  • \d : match one numeric digit.
  • . : match one character (wildcard)
  • \. : match the period character.
  • + : match one or more times whatever precedes : \d+ match one or more numeric digit.
  • * : same but 0 or more.
  • [ ] : match one character in the set framed by [] : [abd] match a, b, or d (and not the string abd).
  • [^ ] : match one character not in the set : [^ef] match one charcater that is neither e nor f.
  • (?<= ) : look-behind : match something (defined by what follows the look behind), given the look behind was matched (but not returned) : (?<=Hello )World matches World when it is preceded by |Hello |.
  • (?= ) : look-forward : Hello(?=World) matches Hello when it is followed by | World|.
That's for matching. Using just this, you can define a pattern for matching your RMS values:
>> regexp( filecontent, '(?<=OVERALL RMS:\s+)[\d\.]+', 'match' )
ans =
1×2 cell array
{'0.017'} {'0.017'}
However, sometimes you want to match something large and extract just specific parts, but the whole thing is too complex to define using look forward/behind. This is where tokens are useful. You define a token by framing what you want to extract from a match between parentheses (when you want to match parentheses you have to backslash them). For example:
Data X=(\d+),Y=(\d+)
matches the whole expression, but if you call REGEXP with the option 'tokens' (usually instead of 'match'), you get the tokens only (corresponding the the the \d+). Now as you can see, you get multiple tokens (here two) per match, and there can be multiple matches (if the content has 10 of these entries, you will get 10 matches), so the output of REGEXP is a cell array (one cell per match) of cell arrays (of tokens), unless you tell it that there will be a single match by adding the option 'once', in which case it outputs a single cell array of tokens (flat structure).
You can get your OVERALL RMS values using either, matches or matches/tokens. If you use tokens, you get a cell array (# cells = number of blocks) of cell arrays of tokens with one token per match (as you will extract a single value). This cannot be directly converted to numeric with STR2DOUBLE which cannot work on cell arrays of cell arrays, and it has to be "flatten" first:
>> rms = regexp( filecontent, 'OVERALL RMS:\s+([\d\.]+)','tokens' )
rms =
1×2 cell array
{1×1 cell} {1×1 cell}
>> rms = [rms{:}] % Concatenate rms content developed in CSL.
rms =
1×2 cell array
{'0.017'} {'0.017'}
>> rms = str2double( rms )
rms =
0.0170 0.0170
The first approach based on a simple match : the number (including the decimal point) preceded by the literal OVERALL RMS: avoids having to flatten the output. The 'match' output of REGEXP is just a (flat) cell array of matches, and this can be passed directly to STR2DOUBLE:
>> rms = str2double( regexp( filecontent, '(?<=OVERALL RMS:\s+)[\d\.]+', 'match' ))
rms =
0.0170 0.0170
Thank you for the comprehensive answer. This will be extremely helpful to look back to in the future.
My pleasure!
(Last edit @ 21:43 UTC)

Sign in to comment.

More Answers (0)

Tags

Asked:

on 5 Oct 2017

Edited:

on 7 Oct 2017

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!