Using regexp to Search Large Text File for Wanted Data

Question

Zachary Parra on 5 Oct 2017

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/359765-using-regexp-to-search-large-text-file-for-wanted-data

Edited: Cedric on 7 Oct 2017

Sample.txt

Hello all! I am attempting to utilize regexp to extract wanted items from text files. Attached below is a sample text file. I wish to extract the start date, EL HGT and Northing (Y)/Easting (X) for UTM (Zone 15). I will eventually have a series of 365 of these text files compressed into one file. Is regexp the best method and how would it be coded? Thanks.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Guillaume on 5 Oct 2017

1
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/359765-using-regexp-to-search-large-text-file-for-wanted-data#answer_284379

Open in MATLAB Online

is regexp the best method?

For EL HGT, probably. For your Northing/Easting for UTM(Zone 15), no because extracting that value involve crossreferencing rows and columns. For that you would have to parse the whole file, and you would have to write your own parser as none of matlab built-in parsers (textscan, etc.) can parse a file that complex as is.

However, if UTM(Zone 15) is always the first column of number for Northing/Easting, then yes you could use a regexp:

filecontent = fileread('sample.txt');
el_hgt_full = regexp(filecontent, '(?<=EL HGT:\s*)[^\n\r]*', 'match', 'once');
el_hgt = str2double(regexp(el_hgt_full, '[+-]?\d+(\.\d+)?', 'match'));
northing = str2double(regexp(filecontent, '(?<=Northing \(Y\) \[meters\]\s*)[+-]?\d+(\.\d+)?', 'match', 'once'));
easting = str2double(regexp(filecontent, '(?<=Easting \(X\)  \[meters\]\s*)[+-]?\d+(\.\d+)?', 'match', 'once'));

6 Comments
Show 4 older commentsHide 4 older comments

Cedric on 7 Oct 2017

Edited: Cedric on 7 Oct 2017

Open in MATLAB Online

Not far :)

In short:

abcdABC : literal, matched as is.
\s : match one white space (including line break, carriage return, tab, etc).
\d : match one numeric digit.
. : match one character (wildcard)
\. : match the period character.
+ : match one or more times whatever precedes : \d+ match one or more numeric digit.
* : same but 0 or more.
[ ] : match one character in the set framed by [] : [abd] match a, b, or d (and not the string abd).
[^ ] : match one character not in the set : [^ef] match one charcater that is neither e nor f.
(?<= ) : look-behind : match something (defined by what follows the look behind), given the look behind was matched (but not returned) : (?<=Hello )World matches World when it is preceded by |Hello |.
(?= ) : look-forward : Hello(?=World) matches Hello when it is followed by | World|.

That's for matching. Using just this, you can define a pattern for matching your RMS values:

 >> regexp( filecontent, '(?<=OVERALL RMS:\s+)[\d\.]+', 'match' )
 ans =
  1×2 cell array
    {'0.017'}    {'0.017'}

However, sometimes you want to match something large and extract just specific parts, but the whole thing is too complex to define using look forward/behind. This is where tokens are useful. You define a token by framing what you want to extract from a match between parentheses (when you want to match parentheses you have to backslash them). For example:

Data X=(\d+),Y=(\d+)

matches the whole expression, but if you call REGEXP with the option 'tokens' (usually instead of 'match'), you get the tokens only (corresponding the the the \d+). Now as you can see, you get multiple tokens (here two) per match, and there can be multiple matches (if the content has 10 of these entries, you will get 10 matches), so the output of REGEXP is a cell array (one cell per match) of cell arrays (of tokens), unless you tell it that there will be a single match by adding the option 'once', in which case it outputs a single cell array of tokens (flat structure).

You can get your OVERALL RMS values using either, matches or matches/tokens. If you use tokens, you get a cell array (# cells = number of blocks) of cell arrays of tokens with one token per match (as you will extract a single value). This cannot be directly converted to numeric with STR2DOUBLE which cannot work on cell arrays of cell arrays, and it has to be "flatten" first:

 >> rms = regexp( filecontent, 'OVERALL RMS:\s+([\d\.]+)','tokens' )
 rms =
  1×2 cell array
    {1×1 cell}    {1×1 cell}
 >> rms = [rms{:}]           % Concatenate rms content developed in CSL.
 rms =
   1×2 cell array
    {'0.017'}    {'0.017'}
 >> rms = str2double( rms )
 rms =
    0.0170    0.0170

The first approach based on a simple match : the number (including the decimal point) preceded by the literal OVERALL RMS: avoids having to flatten the output. The 'match' output of REGEXP is just a (flat) cell array of matches, and this can be passed directly to STR2DOUBLE:

 >> rms = str2double( regexp( filecontent, '(?<=OVERALL RMS:\s+)[\d\.]+', 'match' ))
 rms =
    0.0170    0.0170

Zachary Parra on 7 Oct 2017

Thank you for the comprehensive answer. This will be extremely helpful to look back to in the future.

Cedric on 7 Oct 2017

Edited: Cedric on 7 Oct 2017

My pleasure!

(Last edit @ 21:43 UTC)

Sign in to comment.

Using regexp to Search Large Text File for Wanted Data

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

6 Comments
Show 4 older commentsHide 4 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

Using regexp to Search Large Text File for Wanted Data

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

6 Comments Show 4 older commentsHide 4 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

6 Comments
Show 4 older commentsHide 4 older comments