Australian Bureau of Meteorology (BOM): Extracting Data from historic Local Waters Forecast

2 views (last 30 days)
Dear Matlab Forums,
As part of my thesis I'm investigating the accuracy of BOM's forecast data. I'm trying to see how accurate their maximum predictions are to determine if evasive action may need to be taken for fixed marine structures. I have the historic wavebuoy data, but want to know if BOMs forecasts have been accurate.
BOM is absolutely fantastic at providing data in erratic text format and seems not to care much about useable CSV formats - it's killing me.
What I want from the data:
  • The date
  • The most far reaching swell and seas size forecasts
The data comes in the following formats:
*11:30 28/05/2008* 100398833 CGWEB=11:30
*AIFS_ID=11400* CGWEB
IDW11400
Australian Government Bureau of Meteorology
Western Australia
Local Waters Forecast
Yanchep to Mandurah and Offshore to Rottnest Island
Issued at 11:30 am WST on Wednesday 28 May 2008
Valid until midnight Friday
Please Be Aware
Wind gusts can be 40 percent stronger than the averages given here, and maximum
wave may be up to twice the height.
Warnings
Nil.
Synoptic Situation
A moderate cold front is currently passing over Perth. Fresh W/SW winds behind
front will ease during the afternoon and evening.
Forecasts:
Wednesday until midnight: W/SW winds 18/23 knots easing to 13/18 knots during
the afternoon and becoming SW'ly 10/15 knots in the evening. Seas 1.5m to 2.0m.
Swell rising to 3.0m.
Swell at Cottesloe: rising to 1.0m.
Winds on Melville Water: similar.
Thursday: S/SW winds 8/13 knots tending S/SE 5/10 knots in the evening. Seas to
1.0m. Swell to 3.0m easing later.
Friday: E/NE winds 8/13 knots tending N/NE 10/15 knots in the evening and
increasing to N/NE 15/20 knots towards midnight.
Current Swell Observations:
Rottnest Waverider Buoy: 1.7m
Cottesloe Waverider Buoy: 0.6m
Current swell height information is supplied by the Department for Planning and
Infrastructure and is current only at the time of issue of this forecast
The next routine forecast will be issued at 4:30 pm WST Wednesday.
*16:30 28/05/2008* 100444518 CRAFA=16:30 CGFCS=16:30
XCH10SYD-0296501221=16:31 ENOVAFM=16:30 CGIDF=16:30
*AIFS_ID=11400* CRAFA CGFCS XCH10SYD ENOVAFM PROD CGIDF
IDW11400
Australian Government Bureau of Meteorology
Western Australia
Local Waters Forecast
Yanchep to Mandurah and Offshore to Rottnest Island
Issued at 4:30 pm WST on Wednesday 28 May 2008
Valid until midnight Saturday
Please Be Aware
Wind gusts can be 40 percent stronger than the averages given here, and maximum
wave may be up to twice the height.
Warnings
Nil.
Synoptic Situation
A moderate cold front is currently passing over Perth. Fresh W/SW winds behind
front will ease during the evening.
Forecasts:
Wednesday until midnight: W/SW winds 15/20 knots easing to SW'ly 10/15 knots
during the evening. Seas 1.0m to 1.5m. Swell to 2.5m to 3.5m.
Swell at Cottesloe: to 1.0m.
Winds on Melville Water: similar.
Thursday: S'ly winds 8/13 knots tending SE'ly 8/13 knots in the evening. Inshore
winds tending E/SE 5/10 knots for a period early to mid morning. Seas to 1.0m.
Swell 2.5m to 3.0m
Swell at Cottesloe: to 1.0m.
Winds on Melville Water: will be similar.
Friday: E'ly winds 8/13 knots tending NE'ly 10/15 knots towards midnight. Seas
to 1.0m. Swell to 2.0m, easing.
Saturday: N'ly winds 13/18 knots increasing to NW'ly 20/25 knots during the
morning.
Current Swell Observations:
Rottnest Waverider Buoy: 2.2m
Cottesloe Waverider Buoy: 0.8m
Current swell height information is supplied by the Department for Planning and
Infrastructure and is current only at the time of issue of this forecast
The next routine forecast will be issued at 11:30 pm WST Wednesday.
I've noticed the XCH10SYD code is used whenever a swell & seas forecast are produced - which is a great identifier of the information I'm looking for. Therefore I'm trying to find a way of getting my program to search through the 96,000 lines of ".txt" to search out the "XCH10SYD" classifier. When its found, the program saves the relevant time and date (listed a few lines above), then saves the furthest forecast's date (in this example it's Friday) and associated maximum swell and seas figures.
Things to note:
  • Sometimes seas/swell are listed as "Seas 1.0m", othertimes they're listed as "Seas to 1.0m", and sometimes its listed as "Seas 1.0m to 2.0m". In the latter case, I'm only interested in the maximum value.
  • Sometimes when a particularly long forecast is produced, the number of lines of text changes. ie The code can't really be hard coded to extract data from a particular spot, but has to be flexible to actively search for the numerical data.
  • The HH:MM DD/MM/YYYY and AIFS identifiers seem to be consistent in their location and format. (Perhaps the only consistent aspect of the txt file.
If anyone can even provide advice on where to start, it would be much appreciated. I'm really not a pro in this field, but keen to learn. This is just way beyond my current skillset. Thank you!

Accepted Answer

Cedric
Cedric on 2 Sep 2015
Edited: Cedric on 7 Sep 2015
This is a good candidate for using regular expressions and pattern matching. It would take an hour of your time reading section 2-23 of MATLAB Programming Fundamentals (available here) to get started with regular expressions (as well as quite a bit of experimenting). I develop an example below, which is probably not exactly what you need, because your sample is too short for me to experiment, but I can help you refine the approach.
Assuming that all relevant blocks start with a date/time with the following structure
*11:30 28/05/2008*
we first split the file content in blocks, using this structure as a separator:
% - Read file content in one shot.
content = fileread( 'data.txt' ) ;
% - Split time/data blocks.
pattern = '\*\d\d:\d\d \d\d/\d\d/\d{4}\*' ;
[timeBlocks, dataBlocks] = regexp( content, pattern, 'match', 'split' ) ;
dataBlocks(1) = [] ;
It outputs a cell array of time stamps and a matching cell array of block data:
>> timeBlocks
timeBlocks=
'*11:30 28/05/2008*' '*16:30 28/05/2008*'
>> dataBlocks
dataBlocks =
[1x1459 char] [1x1458 char]
Then we convert date/time data into whatever you need, using e.g. DATEVEC or DATENUM:
dateTime = datevec( timeBlocks, '*HH:MM dd/mm/yyyy' ) ;
which outputs
>> dateTime
dateTime =
2008 5 28 11 30 0
2008 5 28 16 30 0
Finally, we iterate through data blocks and extract data
nBlocks = numel( dataBlocks ) ;
data = cell( nBlocks, 1 ) ;
for bId = 1 : nBlocks
% - Extract days and distances.
pattern = '[\r\n](\S+day).*?Seas.*?to\s+([^m]+)' ;
tokens = regexp( dataBlocks{bId}, pattern, 'tokens' ) ;
tokens = vertcat( tokens{:} ) ;
% - Convert distances to double, and cell array to struct array.
if ~isempty( tokens )
tokens(:,2) = num2cell( str2double( tokens(:,2) )) ;
data{bId} = cell2struct( tokens, {'day', 'distance'}, 2 ) ;
end
end
With that, we get:
>> data
data =
[2x1 struct]
[2x1 struct]
>> data{1}
ans =
2x1 struct array with fields:
day
distance
>> data{1}(1)
ans =
day: 'Wednesday'
distance: 2
>> data{1}(2)
ans =
day: 'Thursday'
distance: 1
This illustrates one way to do it. Pattern matching can be improved, and you will want to modify the structure of the output for it to fit with your needs.
Let me know if you have any question.
PS: the best that you can do to understand is to run the code step by step using the debugger, and see what happens each time a line is executed, e.g. when we get tokens it is a cell array of cell arrays, then we VERTCAT its content to transform it into a simple/flat cell array, then we convert to double its second column, etc.
To use the debugger, set a break point by clicking on the dash at the right of the line number, execute the code (a green arrow will appear, indicating the next line to execute), and click on the Step button. At each step, you can use the command window/workspace/editor (mouse over) to see the state/content of variables.
EDIT: just a few extra explanations about patterns.
The first pattern
pattern = '\*\d\d:\d\d \d\d/\d\d/\d{4}\*'
is fairly easy to understand:
  • \* means: the character *; it has to be escaped because the star has a special signification otherwise: it is a quantifier that means "zero or more times the expression that precedes".
  • \d means: any digit 0-9
  • \d{4} means: four times \d
The second pattern
pattern = '[\r\n](\S+day).*?Seas.*?to\s+([^m]+)' ;
is more complex. REGEXP matches the whole pattern but extracts only the parts in parentheses (called tokens).
  • |[\r
  5 Comments
James McCarthy-Price
James McCarthy-Price on 13 Sep 2015
Edited: James McCarthy-Price on 13 Sep 2015
Hi Cedrick,
Thanks again for your help.
Firstly, yes I've been using the debugger function - very helpful to understand how your code works. Also your assumption about max height of seas and swell are correct - I'm trying to find out what is the maximum furthermost forecast from this data. This projected forecast will be measured against the measured wave data I have, to calculate accuracy of BOM's marine forecasts.
I've created another function that uses the 3 day forecasts from your program, then combines the sea and swell height to give a combined sea state based on BOM's formula . I've attached the output matrix for the 2008 data building upon your regexp script - see "CombSeaSwell.mat". You'll note there are some large outliers. By using the debugger to inspect individual tokens that regexp was pulling, I noticed these were caused by incorrect tokens either by taking wind readings (eg. 18/20 knots) or by taking incorrect swell/seas readings (eg. "Swell at Cottesloe: to 1.0m.").
In these cases, regexp seems to miss the defined pattern.
These misreading examples can be seen on these dates (check out the tokens using debugger):
  • 17/05/2008
  • 21/05/2008
  • 04/06/2008
  • You can see more quite clearly in the .mat file attached.
Here is a sample result from an incorrect scan of the regexp program:
*16:30 07/11/2008* 100002365 CRAFA=16:30 CGFCS=16:30
XCH10SYD-0296501221=16:31 ENOVAFM=16:30 CGIDF=16:30
*AIFS_ID=11400* CRAFA CGFCS XCH10SYD ENOVAFM PROD CGIDF
IDW11400
Australian Government Bureau of Meteorology
Western Australia
Local Waters Forecast
Yanchep to Mandurah and Offshore to Rottnest Island
Issued at 4:30 pm WDT on Friday 7 November 2008
Valid until midnight Monday
Please Be Aware
Wind gusts can be 40 percent stronger than the averages given here, and maximum
wave may be up to twice the height.
Warnings
Nil.
Synoptic Situation
Expect SE'ly winds in the mornings and moderate to fresh afternoon seabreezes
over the next few days with a high to the west and a trough developing just
inland from the west coast.
Forecasts:
Friday until midnight: S/SW winds 15/20 knots tending S/SE 13/18 knots towards
midnight. Seas 1.0m to 1.5m. Swell 2.0m to 2.5m, easing.
Swell at Cottesloe: to1.0m.
Winds on Melville Water: Similar.
Saturday: E'ly winds 10/15 knots shifting S/SW 13/18 knots in the afternoon.
Seas 1.0m to 1.5m. Swell to 2.0m.
Swell at Cottesloe: to 0.9m.
Winds on Melville Water: Similar.
Sunday: S/SE winds 13/18 knots tending S/SW late morning and increasing to 18/23
knots in the afternoon. Seas 1.0m to 1.5m. Swell to 2.0m.
Monday: SE/SW winds 10/15 knots tending S/SW 13/18 knots late morning and
increasing to 18/23 knots in the afternoon.
Current Swell Observations:
Rottnest Waverider Buoy: 1.8m
Cottesloe Waverider Buoy: 0.7m
Current swell height information is supplied by the Department for Planning and
Infrastructure and is current only at the time of issue of this forecast
The next routine forecast will be issued at 11:30 pm WDT Friday.
For some reason the regexp script is picking up the "to 18" from the "to 18/23 knots in the afternoon" for the Sunday.
I've spent the past few days intensely researching the regexp function, character classes and quantifiers. I understand exactly what you've written, and how it works.
I've also been playing with regexp online debuggers to experiment with different patterns such as this:
[\r\n](\S+day).*?Seas.*?to\s+(\d\.\d)m.*Swell.*?to\s(\d\.\d)m.*?to\s(\d\.\d)m
This online debugger suggests regexp takes a logical approach to extracting the chosen tokens - however when inputting the same pattern into MatLab, I'm getting back some unexpected tokens and can't seem to figure out what is causing regexp to perform this way.
I've got a few questions:
  • Whilst I haven't changed the (\S+day) token, changes to subsequent aspects of the pattern (such as (\d\.\d) wave height tokens are causing days to be skipped. I don't understand why.
  • Could we implement a positive lookahead function to deal with the "Swell X.Xm to X.Xm", or "Seas X.Xm rising to X.Xm" (and variations) types of expressions? ie If only a single "Swell to 2.5m" height listed, don't worry about scraping more data that could otherwise lead to incorrect data types.
  • Was there a reason you chose to name tokens later, rather than using 'named tokens' patterns?
Again thanks for your help! I'm really having difficulty understanding why the code is behaving how it is, regexp seems like a bit of beast of a function, you help is much appreciated!
Cedric
Cedric on 14 Sep 2015
Edited: Cedric on 14 Sep 2015
Hi James,
I will start by answering your last comments/questions, and I will come back to the code afterwards.
  • Online tools for evaluating regular expressions are nice, but they can mislead you. Regex(p) engines were implemented for almost all major programming languages and editors; they have a common basis, but they come in various "flavors" with their own specific features and behaviors.
  • I am usually not using named tokens in my answers on the forum, because they make patterns more complicated. You either knew regular expressions already, or made a considerable effort for learning them lately so you can read patterns now. Most people will however stick to the basics (which doesn't include tokens in general) or even use patterns without understanding what they do/represent, and adding named tokens and structs would just add a few extra layers of confusion. I also observed that they can slow down REGEXP in cases where there is no advantage in outputting a struct (for your own purpose, you may just use a cell array of strings for storing columns headers and a numeric array for storing data).
  • Short patterns should be favored as a general rule with regexp. Long ones accumulate side effects and take time to process. It is often more efficient to perform multiple calls to REGEXP (the MATLAB function) with small patterns than to perform a single one with a very complex pattern.
  • UNLESS - and this is my criterion for splitting patterns - you need to synchronize one match to one or more previous match(es).
To illustrate, if you need to extract distances and swells, it is likely that you can create a pattern that extracts both with all their variants in one shot. It may even be fun to do it once for the intellectual challenge. But then in practice, if you don't need to synchronize the match of swells as being right after the match of distances (and control/minimize what is in between), it is likely to be simpler and more efficient to implement two calls to regexp using simpler patterns.
In our case though, we have to synchronize with the match of the day and this is why I kept a complex pattern. Yet, we always have the option to match/extract blobs of text associated with days (located at the beginning of lines), and to iterate through these blobs to match deeper content, the way we currently do with blobs of data separated by date/time tags.
The error that you spotted could lead us in this direction. What happens is exactly what you describe, for the following reason: the Monday that follows the Sunday blob of text has no sea/swell information; it is therefore not matched by the pattern and its content is part of the Sunday blob. As we designed the pattern for matching all variants of the swell information flexible as for where the 'to' is/are (it can be at two places), the first part of the pattern relevant to swell captures the first 'to 2.0m' but a second 'to' from the Monday information is also matched by the second part of the pattern.
You know enough about regexp now to guess that we could find a way to manage this situation, using for example ordinal token operators (match second 'to' only if first not found), but at this stage I would advice you to evaluate what you need to achieve and to update the approach consequently (if needed). I would also suggest that you implement tests that spot locations where it is obvious that the matching failed.
I won't solve the problem for you but just describe the thought process and some options. First, you identified a place where matching fails (the sample that you provide above). I saved it in file sample_001.txt and processed it the way we do in the loop (copy-pasted and just replaced content in the call to REGEXP):
>> content = fileread( 'sample_001.txt' ) ;
>> pattern = '[\r\n](\S+day).*?Seas.*?to\s+([^m]+)m.*?Swell\s*(to )?([\d\.]+)?.*?to\s+([\d\.]+)' ;
tokens = regexp( content, pattern, 'tokens' ) ;
tokens = vertcat( tokens{:} )
tokens =
'Friday' '1.5' '' '2.0' '2.5'
'Saturday' '1.5' 'to ' '2.0' '0.9'
'Sunday' '1.5' 'to ' '2.0' '18'
Here we see that there is no Monday, and we get the 18 instead of nothing for the reason explained above. Looking better we also see that on Saturday it is matching the
Swell at Cottesloe: to 0.9m.
by the way. We can work on this with a test on the fact that the 3rd token 'to ' is matched (ordinal token operator):
>> tokens = regexp( content, '[\r\n](\S+day).*?Seas.*?to\s+([^m]+)m.*?Swell\s*(to )?([\d\.]+)?.*?(?(3)|to\s+[\d\.]+)', 'tokens' ) ;
>> tokens = vertcat( tokens{:} )
tokens =
'Friday' '1.5' '' '2.0' 'to 2.5'
'Saturday' '1.5' 'to ' '2.0' ''
'Sunday' '1.5' 'to ' '2.0' ''
We see that the matching is correct, but that we need a little post-processing to extract the value from the last token when present:
>> sscanf( tokens{1,end}, 'to %f' )
ans =
2.5000
.. which is easy to do in the loop over days that converts strings to numbers and picks the max. So this would be an option.
HOWEVER: we eliminate a case of mismatch, from a situation that should maybe not happen in the first place. Maybe you need to know that there is a Monday with no data. Maybe you need to extract only seas information when present, or only swell information when present.
To do so, we would have to split blocks into days, and record/test what is associated to each day:
>> [match, split] = regexp( content, '[\r\n]\S+day.*?', 'match', 'split' )
match =
[1x7 char] [1x9 char] [1x7 char] [1x7 char]
split =
[1x790 char] [1x202 char] [1x176 char] [1x136 char] [1x441 char]
Now you see that we have four blocks:
>> match{:}
ans =
Friday
ans =
Saturday
ans =
Sunday
ans =
Monday
with the corresponding blobs of text stored in cell array split. This may make "things" easier for managing all variants, avoiding interference, etc, especially when you see that the last blob (for Monday) is the whole:
>> split{end}
ans =
: SE/SW winds 10/15 knots tending S/SW 13/18 knots late morning and
increasing to 18/23 knots in the afternoon.
Current Swell Observations:
Rottnest Waverider Buoy: 1.8m
Cottesloe Waverider Buoy: 0.7m
Current swell height information is supplied by the Department for Planning and
Infrastructure and is current only at the time of issue of this forecast
The next routine forecast will be issued at 11:30 pm WDT Friday.
Here, maybe you want to extract the buoy information, maybe not, and maybe you want to be sure that it is not extracted by mistake.
So now you have two approaches: the current which is in pseudo-code:
match data blocks based on time/date stamp
iterate through data blocks
if no XCH10SYD in block -> loopback
extract day/seas/swell information in one shot with complex pattern
test/store relevant values
and a less concise but probably simpler and more robust
match data blocks based on time/date stamp
iterate through data blocks
if no XCH10SYD in block -> loopback
match day blocks based on presence of day at beginning of line
iterate through day blocks
extract seas information if present with simple pattern
extract swell information if present with simple pattern
test/store relevant values
Hope it helps!

Sign in to comment.

More Answers (0)

Categories

Find more on Data Type Identification in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!