Australian Bureau of Meteorology (BOM): Extracting Data from historic Local Waters Forecast

Question

James McCarthy-Price on 2 Sep 2015

2
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/239117-australian-bureau-of-meteorology-bom-extracting-data-from-historic-local-waters-forecast

Edited: Cedric on 14 Sep 2015

Dear Matlab Forums,

As part of my thesis I'm investigating the accuracy of BOM's forecast data. I'm trying to see how accurate their maximum predictions are to determine if evasive action may need to be taken for fixed marine structures. I have the historic wavebuoy data, but want to know if BOMs forecasts have been accurate.

BOM is absolutely fantastic at providing data in erratic text format and seems not to care much about useable CSV formats - it's killing me.

What I want from the data:

The date
The most far reaching swell and seas size forecasts

The data comes in the following formats:

*11:30 28/05/2008* 100398833 CGWEB=11:30
*AIFS_ID=11400* CGWEB
IDW11400 
Australian Government Bureau of Meteorology 
Western Australia 
Local Waters Forecast 
Yanchep to Mandurah and Offshore to Rottnest Island 
Issued at 11:30 am WST on Wednesday 28 May 2008 
Valid until midnight Friday 
Please Be Aware 
Wind gusts can be 40 percent stronger than the averages given here, and maximum 
wave may be up to twice the height. 
Warnings 
Nil. 
Synoptic Situation 
A moderate cold front is currently passing over Perth. Fresh W/SW winds behind 
front will ease during the afternoon and evening. 
Forecasts: 
Wednesday until midnight: W/SW winds 18/23 knots easing to 13/18 knots during 
the afternoon and becoming SW'ly 10/15 knots in the evening. Seas 1.5m to 2.0m. 
Swell rising to 3.0m. 
Swell at Cottesloe:  rising to 1.0m. 
Winds on Melville Water: similar. 
Thursday: S/SW winds 8/13 knots tending S/SE 5/10 knots in the evening. Seas to 
1.0m. Swell to 3.0m  easing later. 
Friday: E/NE winds 8/13 knots tending N/NE 10/15 knots in the evening and 
increasing to N/NE 15/20 knots towards midnight. 
Current Swell Observations: 
Rottnest Waverider Buoy:  1.7m 
Cottesloe Waverider Buoy: 0.6m 
Current swell height information is supplied by the Department for Planning and 
Infrastructure and is current only at the time of issue of this forecast 
The next routine forecast will be issued at 4:30 pm WST Wednesday.
*16:30 28/05/2008* 100444518 CRAFA=16:30 CGFCS=16:30
XCH10SYD-0296501221=16:31 ENOVAFM=16:30 CGIDF=16:30
*AIFS_ID=11400* CRAFA CGFCS XCH10SYD ENOVAFM PROD CGIDF
IDW11400 
Australian Government Bureau of Meteorology 
Western Australia 
Local Waters Forecast 
Yanchep to Mandurah and Offshore to Rottnest Island 
Issued at 4:30 pm WST on Wednesday 28 May 2008 
Valid until midnight Saturday 
Please Be Aware 
Wind gusts can be 40 percent stronger than the averages given here, and maximum 
wave may be up to twice the height. 
Warnings 
Nil. 
Synoptic Situation 
A moderate cold front is currently passing over Perth. Fresh W/SW winds behind 
front will ease during the evening. 
Forecasts: 
Wednesday until midnight: W/SW winds 15/20 knots easing to SW'ly 10/15 knots 
during the evening. Seas 1.0m to 1.5m. Swell to 2.5m to 3.5m. 
Swell at Cottesloe:  to 1.0m. 
Winds on Melville Water: similar. 
Thursday: S'ly winds 8/13 knots tending SE'ly 8/13 knots in the evening. Inshore 
winds tending E/SE 5/10 knots for a period early to mid morning. Seas to 1.0m. 
Swell 2.5m to 3.0m 
Swell at Cottesloe:  to 1.0m. 
Winds on Melville Water: will be similar. 
Friday: E'ly winds 8/13 knots tending NE'ly 10/15 knots towards midnight. Seas 
to 1.0m. Swell to 2.0m, easing. 
Saturday: N'ly winds 13/18 knots increasing to NW'ly 20/25 knots during the 
morning. 
Current Swell Observations: 
Rottnest Waverider Buoy:  2.2m 
Cottesloe Waverider Buoy: 0.8m 
Current swell height information is supplied by the Department for Planning and 
Infrastructure and is current only at the time of issue of this forecast 
The next routine forecast will be issued at 11:30 pm WST Wednesday.

I've noticed the XCH10SYD code is used whenever a swell & seas forecast are produced - which is a great identifier of the information I'm looking for. Therefore I'm trying to find a way of getting my program to search through the 96,000 lines of ".txt" to search out the "XCH10SYD" classifier. When its found, the program saves the relevant time and date (listed a few lines above), then saves the furthest forecast's date (in this example it's Friday) and associated maximum swell and seas figures.

Things to note:

Sometimes seas/swell are listed as "Seas 1.0m", othertimes they're listed as "Seas to 1.0m", and sometimes its listed as "Seas 1.0m to 2.0m". In the latter case, I'm only interested in the maximum value.
Sometimes when a particularly long forecast is produced, the number of lines of text changes. ie The code can't really be hard coded to extract data from a particular spot, but has to be flexible to actively search for the numerical data.
The HH:MM DD/MM/YYYY and AIFS identifiers seem to be consistent in their location and format. (Perhaps the only consistent aspect of the txt file.

If anyone can even provide advice on where to start, it would be much appreciated. I'm really not a pro in this field, but keen to learn. This is just way beyond my current skillset. Thank you!

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Cedric on 2 Sep 2015

2
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/239117-australian-bureau-of-meteorology-bom-extracting-data-from-historic-local-waters-forecast#answer_191088

Edited: Cedric on 7 Sep 2015

This is a good candidate for using regular expressions and pattern matching. It would take an hour of your time reading section 2-23 of MATLAB Programming Fundamentals (available here) to get started with regular expressions (as well as quite a bit of experimenting). I develop an example below, which is probably not exactly what you need, because your sample is too short for me to experiment, but I can help you refine the approach.

Assuming that all relevant blocks start with a date/time with the following structure

*11:30 28/05/2008*

we first split the file content in blocks, using this structure as a separator:

 % - Read file content in one shot.
 content = fileread( 'data.txt' ) ;
 % - Split time/data blocks.
 pattern = '\*\d\d:\d\d \d\d/\d\d/\d{4}\*' ;
 [timeBlocks, dataBlocks] = regexp( content, pattern, 'match', 'split' ) ;
 dataBlocks(1) = [] ;

It outputs a cell array of time stamps and a matching cell array of block data:

 >> timeBlocks
 timeBlocks= 
    '*11:30 28/05/2008*'    '*16:30 28/05/2008*'
 >> dataBlocks
 dataBlocks = 
    [1x1459 char]    [1x1458 char]

Then we convert date/time data into whatever you need, using e.g. DATEVEC or DATENUM:

dateTime = datevec( timeBlocks, '*HH:MM dd/mm/yyyy' ) ;

which outputs

 >> dateTime 
 dateTime =
        2008           5          28          11          30           0
        2008           5          28          16          30           0

Finally, we iterate through data blocks and extract data

 nBlocks = numel( dataBlocks ) ;
 data    = cell( nBlocks, 1 ) ;
 for bId = 1 : nBlocks
    % - Extract days and distances.
    pattern = '[\r\n](\S+day).*?Seas.*?to\s+([^m]+)' ;
    tokens  = regexp( dataBlocks{bId}, pattern, 'tokens' ) ;
    tokens  = vertcat( tokens{:} ) ;
    % - Convert distances to double, and cell array to struct array.
    if ~isempty( tokens )
        tokens(:,2) = num2cell( str2double( tokens(:,2) )) ;
        data{bId} = cell2struct( tokens, {'day', 'distance'}, 2 ) ;
    end
 end

With that, we get:

 >> data
 data = 
    [2x1 struct]
    [2x1 struct]
 >> data{1}
 ans = 
    2x1 struct array with fields:
    day
    distance
 >> data{1}(1)
 ans = 
         day: 'Wednesday'
    distance: 2
 >> data{1}(2)
 ans = 
         day: 'Thursday'
    distance: 1

This illustrates one way to do it. Pattern matching can be improved, and you will want to modify the structure of the output for it to fit with your needs.

Let me know if you have any question.

PS: the best that you can do to understand is to run the code step by step using the debugger, and see what happens each time a line is executed, e.g. when we get tokens it is a cell array of cell arrays, then we VERTCAT its content to transform it into a simple/flat cell array, then we convert to double its second column, etc.

To use the debugger, set a break point by clicking on the dash at the right of the line number, execute the code (a green arrow will appear, indicating the next line to execute), and click on the Step button. At each step, you can use the command window/workspace/editor (mouse over) to see the state/content of variables.

EDIT: just a few extra explanations about patterns.

The first pattern

pattern = '\*\d\d:\d\d \d\d/\d\d/\d{4}\*'

is fairly easy to understand:

\* means: the character *; it has to be escaped because the star has a special signification otherwise: it is a quantifier that means "zero or more times the expression that precedes".
\d means: any digit 0-9
\d{4} means: four times \d

The second pattern

pattern = '[\r\n](\S+day).*?Seas.*?to\s+([^m]+)' ;

is more complex. REGEXP matches the whole pattern but extracts only the parts in parentheses (called tokens).

|[\r

5 Comments
Show 3 older commentsHide 3 older comments

James McCarthy-Price on 4 Sep 2015

Local Waters Forecast_2008.txt

Dear Cedric,

Firstly - wow! Thank you so much for your incredibly detailed response, it has helped immensely.

I've progressed a little further and discovered a few issues:

I also need to extract the Swell data, and combine the maximum swell and seas predictions to get total wave height data using this formula (Combined sea and swell height = [(Wind Wave Height)2 + (Swell Wave Height)2]1/2). This combined height and the date are the only outputs I'm looking to get. Furthermore the data I'm looking for only corresponds to the longest ~40 hour seas/swell forecasts that "XCH10SYD" product produces. If you scan through the attached text file you'll see what I mean. I've tried to do so building on your code buy I'm having trouble with it. See code below.
Sometimes the forecast predictions are listed as "Swell 2.5m", "Swell to 2.5m", "Swell to 2.5m, easing to 1.5m.", "Swell 1.5m rising to 2.5m", "Swell to 1.5m, rising to 2.5m.". There are a large number of these variations, and I'm only interested in the maximum figures that I can use to get the maximum predicted forecasts.

Please find my modified code below:

%%%%%%%%Iterate through data blocks: %%%%%%%%
nBlocks = numel( dataBlocks ) ;
 data    = cell( nBlocks, 1 ) ;
    nBlocks = numel( dataBlocks ) ;
   data    = cell( nBlocks, 1 ) ;
   for bId = 1 : nBlocks
      % - Extract days & Swell and seas heights:
      %pattern = '[\r\n](\S+day).*?Seas.*?to\s+([^m]+)\s+?Swell.*?to\s+([^m]+)' ;
      pattern = '[\r\n](\S+day).*?Seas.*?to\s+([^m]+).*?Swell.*?\s(\d+\.[0|5]+).*?\s(\d+\.[0|5]+).*?';
      tokens  = regexp( dataBlocks{bId}, pattern, 'tokens' ) ;
      tokens  = vertcat( tokens{:} ) ;
      disp(tokens);
      % - Convert distances to double, and cell array to struct array.
      if ~isempty( tokens )
          disp(tokens{:,2});
          tokens{:,2} =  str2double( tokens{:,2} ) ;
          tokens{:,3} =  str2double( tokens{:,3} ) ;
          tokens{:,4} =  str2double( tokens{:,4} ) ;
          disp(tokens{:,4});
          data{bId} = struct('day',tokens{:,1},'distance', tokens{:,2}, 'swell', tokens{:,3}, 'swell2', tokens{:,4});
         % data{bId} = cell2struct( tokens, {'day', 'distance', 'swell', 'swell2'}, 4) ;
          %data{bId} = cell2struct( tokens, {'day', 'distance'},2);
      end
   end

I've left some commented code in to show you where I've really hit a wall. Firstly, I've updated the pattern to try and hunt out both seas and swell. However, I cannot for the life of me get it to accurately scrape the forecasts data, or convert using the cell2struct. I've sought help from my professors here and we've still had a few problems with it.

I've uploaded a single "Local Waters Forecast_2008.txt" as an example to give you a better idea of the data set.

I hope that makes a bit better sense.

Note upload is purely for educational purposes, and shall not be used for commercial use *

Cedric on 6 Sep 2015

Edited: Cedric on 7 Sep 2015

EDIT 20150906: updated 2nd pattern for managing Swell entries like

Swell to 3.0m, rising to 4.0m later.

------------------------------------------------------------------------

Your attempts are good actually, but matching all possible cases in one call is tricky when cases differ too much. Also, regexp is very efficient with small patterns speed-wise, but the speed deteriorates quickly when patterns get large and complex. If you are treating a single string it's fine is it takes 1 or 2 seconds, but not when you are treating thousands of blocks. What we usually do in these cases is to implement a tiered approach that involves multiple calls to regexp with small patterns.

If we look at a few cases, we see situations like

 Friday: NE/NW winds 8/13 knots tending W/SW 8/13 knots during the afternoon. 
 Seas to 1.0m. Swell 2.5m easing to 1.5m later. 
 Swell at Cottesloe:  to 1.0m. 
 Winds on Melville Water: Similar.

where there is one 'Seas' item, but two 'Swell' items with different formats. Sometimes, there is just one 'Swell' item as in:

 Saturday: Variable winds to 10 knots tending S/SW 5/10 knots in the afternoon. 
 Seas to 0.5m. Swell to 2.5m. 
 Sunday: SE/SW winds 8/13 knots.

There are also current swell observations as in:

 Current Swell Observations: 
 Rottnest Waverider Buoy:  2.4m 
 Cottesloe Waverider Buoy: 0.8m

And sometimes it seems that a last partial day is aggregated with a last full day and with almost no information:

 Saturday: Variable winds to 10 knots tending S/SW 5/10 knots in the afternoon. 
 Seas to 0.5m. Swell to 2.5m. 
 Sunday: SE/SW winds 8/13 knots.

Actually, it looks like sometimes there is no space between the two days as above, but sometimes there is one as below:

 Monday: N/NE winds 13/18 knots, increasing to N/NW 18/23 knots towards midnight. 
 Seas 1.5m to 2.0m. Swell to 1.0m. 
 Tuesday: N/NW winds 25/33 knots shifting W/SW 25/33 knots early afternoon.

At this stage I assume that you need to extract max 'Seas' and max 'Swell', but not the current observation and not the 'Swell at Cottesloe'. Under this assumption, it makes sense to go on the way we did, but if you need to match more (e.g. the other two/three Swell data), we will add a tier and e.g. split on blocks and days, and then extract not from complex blocks, but from simpler days.

I updated your code so it matches days, max 'Seas' and max 'Swell'. Execute step by step and see if it works for you. I also added a test for the keyword 'XCH10SYD': if not found, it loops back (and skips the current block).

 content = fileread( 'Local Waters Forecast_2008.txt' ) ;
 pattern = '\*\d\d:\d\d \d\d/\d\d/\d{4}\*' ;
 [timeBlocks, dataBlocks] = regexp( content, pattern, 'match', 'split' ) ;
 dataBlocks(1) = [] ;
 dateTime = datevec( timeBlocks, '*HH:MM dd/mm/yyyy' ) ;
 nBlocks = numel( dataBlocks ) ;
 data    = cell( nBlocks, 1 ) ;
 dId     = 0 ;   % Data ID
 for bId = 1 : nBlocks
    % - Skip bloc if 'XCH10SYD' not found. Increment data ID  and parse
    %  content otherwise.
    if isempty( regexp( dataBlocks{bId}, 'XCH10SYD', 'once' ))
        continue
    end
    dId = dId + 1 ;
    % - Extract days, seas distances, and swell.
    pattern = '[\r\n](\S+day).*?Seas.*?to\s+([^m]+)m.*?Swell\s*(to )?([\d\.]+)?.*?to\s+([\d\.]+)' ;
    tokens  = regexp( dataBlocks{bId}, pattern, 'tokens' ) ;
    tokens  = vertcat( tokens{:} ) ;
    % - Build data entry. Create `time` field with the relevant row of the 
    %  `dateTime` array. Create `table` field with days and data. For this,
    %  Convert seas distances to double, swells to double, pick max swells
    %  because sometimes they come in reverse order (e.g. in "Swell 2.5m 
    %  easing to 1.5m later"), and convert a cell array with day, seas, and
    %  swell information to struct array.
    if ~isempty( tokens )
        data{dId}.time = dateTime(bId, :) ;
        item = [tokens(:,1), num2cell( [str2double( tokens(:,2) ), ...
             max( str2double( tokens(:,4:5) ), [], 2)] ) ] ;
        data{dId}.table = cell2struct( item, {'day', 'sea', 'swell'}, 2 ) ;
    end
 end
 % - Truncate data to non-empty cells.
 data = data(1:dId) ;

If you debug it and place a break point on the "if ~isempty" line, you will see that for the first block which contains 'XCH10SYD', tokens is as follows:

 K>> tokens
 tokens = 
    'Thursday'    '1.0'    ''       '2.5'
    'Friday'      '1.0'    '2.5'    '1.5'
    'Saturday'    '0.5'    ''       '2.5'

because the text for the second day is "Swell 2.5m easing to 1.5m later". This is why we take the max and we don't care about the order later when we build the item.

Otherwise, I did the following:

As not all blocks are relevant but only those with the XCH10SYD code somewhere, we use a `data` cell array with its own counter, different from the blocks counter. When the code is not found the code loops back and increment the block counter, but the data counter stays unchanged.
The pattern is close to what you have done. I changed a few things: we match numbers associated with the swell with ([\d\.]+) now (one ore more character in the set of all digits ( \d ) as well as the decimal point \.). This is more specific than "all that is not 'm'.

Note that depending your version of MATLAB, the call to MAX will fail because old versions don't support NaN entries (which happen on cells of tokens with an empty content). If it happens, we can easily upate the code so only non-NaN entries are taken in account.

Finally, you may not want this cell array of struct arrays for the output. I did it this way because you can have multiple entries per block (as each block contains a forecast for a variable number of days), but it is not a flat data structure. It is hence a bit cumbersome for extracting e.g. all swell data in one shot.

James McCarthy-Price on 13 Sep 2015

Edited: James McCarthy-Price on 13 Sep 2015

CombSeaSwell.mat

Hi Cedrick,

Thanks again for your help.

Firstly, yes I've been using the debugger function - very helpful to understand how your code works. Also your assumption about max height of seas and swell are correct - I'm trying to find out what is the maximum furthermost forecast from this data. This projected forecast will be measured against the measured wave data I have, to calculate accuracy of BOM's marine forecasts.

I've created another function that uses the 3 day forecasts from your program, then combines the sea and swell height to give a combined sea state based on BOM's formula . I've attached the output matrix for the 2008 data building upon your regexp script - see "CombSeaSwell.mat". You'll note there are some large outliers. By using the debugger to inspect individual tokens that regexp was pulling, I noticed these were caused by incorrect tokens either by taking wind readings (eg. 18/20 knots) or by taking incorrect swell/seas readings (eg. "Swell at Cottesloe: to 1.0m.").

In these cases, regexp seems to miss the defined pattern.

These misreading examples can be seen on these dates (check out the tokens using debugger):

17/05/2008
21/05/2008
04/06/2008
You can see more quite clearly in the .mat file attached.

Here is a sample result from an incorrect scan of the regexp program:

*16:30 07/11/2008* 100002365 CRAFA=16:30 CGFCS=16:30
XCH10SYD-0296501221=16:31 ENOVAFM=16:30 CGIDF=16:30
*AIFS_ID=11400* CRAFA CGFCS XCH10SYD ENOVAFM PROD CGIDF
IDW11400 
Australian Government Bureau of Meteorology 
Western Australia 
Local Waters Forecast 
Yanchep to Mandurah and Offshore to Rottnest Island 
Issued at 4:30 pm WDT on Friday 7 November 2008 
Valid until midnight Monday 
Please Be Aware 
Wind gusts can be 40 percent stronger than the averages given here, and maximum 
wave may be up to twice the height. 
Warnings 
Nil. 
Synoptic Situation 
Expect SE'ly winds in the mornings and moderate to fresh afternoon seabreezes 
over the next few days with a high to the west and a trough developing just 
inland from the west coast. 
Forecasts: 
Friday until midnight: S/SW winds 15/20 knots tending S/SE 13/18 knots towards 
midnight. Seas 1.0m to 1.5m. Swell 2.0m to 2.5m, easing. 
Swell at Cottesloe:  to1.0m. 
Winds on Melville Water: Similar. 
Saturday: E'ly winds 10/15 knots shifting S/SW 13/18 knots in the afternoon. 
Seas 1.0m to 1.5m. Swell to 2.0m. 
Swell at Cottesloe:  to 0.9m. 
Winds on Melville Water: Similar. 
Sunday: S/SE winds 13/18 knots tending S/SW late morning and increasing to 18/23 
knots in the afternoon. Seas 1.0m to 1.5m. Swell to 2.0m. 
Monday: SE/SW winds 10/15 knots tending S/SW 13/18 knots late morning and 
increasing to 18/23 knots in the afternoon. 
Current Swell Observations: 
Rottnest Waverider Buoy:  1.8m 
Cottesloe Waverider Buoy: 0.7m 
Current swell height information is supplied by the Department for Planning and 
Infrastructure and is current only at the time of issue of this forecast 
The next routine forecast will be issued at 11:30 pm WDT Friday.

For some reason the regexp script is picking up the "to 18" from the "to 18/23 knots in the afternoon" for the Sunday.

I've spent the past few days intensely researching the regexp function, character classes and quantifiers. I understand exactly what you've written, and how it works.

I've also been playing with regexp online debuggers to experiment with different patterns such as this:

[\r\n](\S+day).*?Seas.*?to\s+(\d\.\d)m.*Swell.*?to\s(\d\.\d)m.*?to\s(\d\.\d)m

This online debugger suggests regexp takes a logical approach to extracting the chosen tokens - however when inputting the same pattern into MatLab, I'm getting back some unexpected tokens and can't seem to figure out what is causing regexp to perform this way.

I've got a few questions:

Whilst I haven't changed the (\S+day) token, changes to subsequent aspects of the pattern (such as (\d\.\d) wave height tokens are causing days to be skipped. I don't understand why.
Could we implement a positive lookahead function to deal with the "Swell X.Xm to X.Xm", or "Seas X.Xm rising to X.Xm" (and variations) types of expressions? ie If only a single "Swell to 2.5m" height listed, don't worry about scraping more data that could otherwise lead to incorrect data types.
Was there a reason you chose to name tokens later, rather than using 'named tokens' patterns?

Again thanks for your help! I'm really having difficulty understanding why the code is behaving how it is, regexp seems like a bit of beast of a function, you help is much appreciated!

Cedric on 14 Sep 2015

Edited: Cedric on 14 Sep 2015

Hi James,

I will start by answering your last comments/questions, and I will come back to the code afterwards.

Online tools for evaluating regular expressions are nice, but they can mislead you. Regex(p) engines were implemented for almost all major programming languages and editors; they have a common basis, but they come in various "flavors" with their own specific features and behaviors.
I am usually not using named tokens in my answers on the forum, because they make patterns more complicated. You either knew regular expressions already, or made a considerable effort for learning them lately so you can read patterns now. Most people will however stick to the basics (which doesn't include tokens in general) or even use patterns without understanding what they do/represent, and adding named tokens and structs would just add a few extra layers of confusion. I also observed that they can slow down REGEXP in cases where there is no advantage in outputting a struct (for your own purpose, you may just use a cell array of strings for storing columns headers and a numeric array for storing data).
Short patterns should be favored as a general rule with regexp. Long ones accumulate side effects and take time to process. It is often more efficient to perform multiple calls to REGEXP (the MATLAB function) with small patterns than to perform a single one with a very complex pattern.
UNLESS - and this is my criterion for splitting patterns - you need to synchronize one match to one or more previous match(es).

To illustrate, if you need to extract distances and swells, it is likely that you can create a pattern that extracts both with all their variants in one shot. It may even be fun to do it once for the intellectual challenge. But then in practice, if you don't need to synchronize the match of swells as being right after the match of distances (and control/minimize what is in between), it is likely to be simpler and more efficient to implement two calls to regexp using simpler patterns.

In our case though, we have to synchronize with the match of the day and this is why I kept a complex pattern. Yet, we always have the option to match/extract blobs of text associated with days (located at the beginning of lines), and to iterate through these blobs to match deeper content, the way we currently do with blobs of data separated by date/time tags.

The error that you spotted could lead us in this direction. What happens is exactly what you describe, for the following reason: the Monday that follows the Sunday blob of text has no sea/swell information; it is therefore not matched by the pattern and its content is part of the Sunday blob. As we designed the pattern for matching all variants of the swell information flexible as for where the 'to' is/are (it can be at two places), the first part of the pattern relevant to swell captures the first 'to 2.0m' but a second 'to' from the Monday information is also matched by the second part of the pattern.

You know enough about regexp now to guess that we could find a way to manage this situation, using for example ordinal token operators (match second 'to' only if first not found), but at this stage I would advice you to evaluate what you need to achieve and to update the approach consequently (if needed). I would also suggest that you implement tests that spot locations where it is obvious that the matching failed.

I won't solve the problem for you but just describe the thought process and some options. First, you identified a place where matching fails (the sample that you provide above). I saved it in file sample_001.txt and processed it the way we do in the loop (copy-pasted and just replaced content in the call to REGEXP):

 >> content = fileread( 'sample_001.txt' ) ;
 >> pattern = '[\r\n](\S+day).*?Seas.*?to\s+([^m]+)m.*?Swell\s*(to )?([\d\.]+)?.*?to\s+([\d\.]+)' ;
    tokens  = regexp( content, pattern, 'tokens' ) ;
    tokens  = vertcat( tokens{:} )
 tokens = 
    'Friday'      '1.5'    ''       '2.0'    '2.5'
    'Saturday'    '1.5'    'to '    '2.0'    '0.9'
    'Sunday'      '1.5'    'to '    '2.0'    '18'

Here we see that there is no Monday, and we get the 18 instead of nothing for the reason explained above. Looking better we also see that on Saturday it is matching the

Swell at Cottesloe: to 0.9m.

by the way. We can work on this with a test on the fact that the 3rd token 'to ' is matched (ordinal token operator):

 >> tokens = regexp( content, '[\r\n](\S+day).*?Seas.*?to\s+([^m]+)m.*?Swell\s*(to )?([\d\.]+)?.*?(?(3)|to\s+[\d\.]+)', 'tokens' ) ;
 >> tokens = vertcat( tokens{:} )
 tokens = 
    'Friday'      '1.5'    ''       '2.0'    'to 2.5'
    'Saturday'    '1.5'    'to '    '2.0'    ''      
    'Sunday'      '1.5'    'to '    '2.0'    ''

We see that the matching is correct, but that we need a little post-processing to extract the value from the last token when present:

 >> sscanf( tokens{1,end}, 'to %f' )
 ans =
    2.5000

.. which is easy to do in the loop over days that converts strings to numbers and picks the max. So this would be an option.

HOWEVER: we eliminate a case of mismatch, from a situation that should maybe not happen in the first place. Maybe you need to know that there is a Monday with no data. Maybe you need to extract only seas information when present, or only swell information when present.

To do so, we would have to split blocks into days, and record/test what is associated to each day:

 >> [match, split] = regexp( content, '[\r\n]\S+day.*?', 'match', 'split' )
 match = 
    [1x7 char]    [1x9 char]    [1x7 char]    [1x7 char]
split = 
    [1x790 char]    [1x202 char]    [1x176 char]    [1x136 char]    [1x441 char]

Now you see that we have four blocks:

 >> match{:}
 ans =
     Friday
 ans =
     Saturday
 ans =
     Sunday
 ans =
     Monday

with the corresponding blobs of text stored in cell array split. This may make "things" easier for managing all variants, avoiding interference, etc, especially when you see that the last blob (for Monday) is the whole:

 >> split{end}
 ans =
 : SE/SW winds 10/15 knots tending S/SW 13/18 knots late morning and 
 increasing to 18/23 knots in the afternoon. 
 Current Swell Observations: 
 Rottnest Waverider Buoy:  1.8m 
 Cottesloe Waverider Buoy: 0.7m 
 Current swell height information is supplied by the Department for Planning and 
 Infrastructure and is current only at the time of issue of this forecast 
 The next routine forecast will be issued at 11:30 pm WDT Friday.

Here, maybe you want to extract the buoy information, maybe not, and maybe you want to be sure that it is not extracted by mistake.

So now you have two approaches: the current which is in pseudo-code:

 match data blocks based on time/date stamp 
 iterate through data blocks
    if no XCH10SYD in block -> loopback
    extract day/seas/swell information in one shot with complex pattern
    test/store relevant values

and a less concise but probably simpler and more robust

 match data blocks based on time/date stamp 
 iterate through data blocks
    if no XCH10SYD in block -> loopback
    match day blocks based on presence of day at beginning of line
    iterate through day blocks
        extract seas information if present with simple pattern
        extract swell information if present with simple pattern
        test/store relevant values

Hope it helps!

Sign in to comment.

Australian Bureau of Meteorology (BOM): Extracting Data from historic Local Waters Forecast

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

5 Comments
Show 3 older commentsHide 3 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

Australian Bureau of Meteorology (BOM): Extracting Data from historic Local Waters Forecast

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

5 Comments Show 3 older commentsHide 3 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

5 Comments
Show 3 older commentsHide 3 older comments