Best way to parse text file

191 views (last 30 days)
Jesse Program
Jesse Program on 7 Apr 2015
Edited: per isakson on 14 Apr 2015
I am trying to write some code to parse large text files of weather data. The formatting of the text file is broken into blocks, by state. The data I'm working with is here
Within each block (state), the .txt file formatting is consistent and should be fairly straight forward to parse. However, each block has an inconsistent number of rows (varying amount of reporting weather stations across states, and time) it not clear to me that each block is delimited from one another. I'm trying to get ultimately arrive at an N x 3 matrix, with columns for state, county, and total snowfall.
I'm not sure which functions are best for this purpose. Can someone point me in the right direction? I don't know where to start....
Thanks
  5 Comments
Jesse Program
Jesse Program on 10 Apr 2015
Hi Per,
I saw you posted some code earlier, but now it's gone?
Thanks
per isakson
per isakson on 10 Apr 2015
Edited: per isakson on 10 Apr 2015
Yes, but there was a major "issue" and no time to fix it. Now, I have posted a new code.

Sign in to comment.

Accepted Answer

per isakson
per isakson on 10 Apr 2015
Edited: per isakson on 14 Apr 2015
NOAA's National Climatic Data Center (NCDC) (and others) publish important data in text files, which cannot be easily read with Matlab. Should I blame NOAA or The Mathworks?
OP provided a sample file, which contains weather data of Mars 2015. I have counted columns and positions and made mistakes. Surely NOAA provides a description of the file format. However, I didn't find it. It would have helped me to make a better code.
  • The file has fixed width columns and the values are padded with spaces.
  • The first column contains numerical data with leading spaces.
  • Missing numerical data are replaced by "-9999" and missing string data are replaced by spaces.
  • The sections have different number of columns.
The Matlab documentation isn't enough to write the code. One has to test and accept that Matlab will have the final word. Example: My interpretation of the documentation (R2013b) made me believe that
textscan( str, '%98c%*[^\n]' )
reads the first 98 characters of a string. However, that depends on whether there are leading spaces.
textscan( str, '%98c%*[^\n]', 'Whitespace','' )
does indeed read the first 98 characters of a string. But then I cannot use the spaces in the second part of the row. Issues like this one leads me into a trial and error path, which certainly does not end up in the best code.
&nbsp
"Best way to parse text file?" &nbsp My criteria
  • return the content of the file in a way that is suitable for further processing
  • easy to understand in a year from now
  • usable next time there is need to read this type of file
  • throw some kind of message if it fails to read a file (beware of automagic)
  • the function shall read and parse and nothing else. Selection and sorting is left to a following step
"Something like a struct for each state and then a sub-variable for each county" &nbsp ... ...
>> sas = snow('dlysnfl.txt')
sas =
AL: [1x23 struct]
AR: [1x119 struct]
AZ: [1x60 struct]
CA: [1x103 struct]
...
>> sas.AL
ans =
1x23 struct array with fields:
State
Lat
Lon
COOP
StnID
Station
County
Elev
data
>> is_MADISON = strcmp('MADISON',{sas.('AL').County});
>> cat( 1, sas.('AL')(is_MADISON).data )
ans =
Columns 1 through 8
0 0 0 0 1 0 0 -9999
0 0 0 0 0 0 0 0
...
>> sas.('AL')(is_MADISON).Station
ans =
HUNTSVILLE INTL AP
ans =
OWENS CROSS ROADS 3S
where
function sas = snow( filespec )
str = fileread( filespec );
spc = '[ ]*';
nls = '[\r\n]+';
xpr = [ '(?<=State:)',spc,'(\w+)' ... match "State:" and capture following
, spc, nls ... group of letters
, '([^\r\n]+)', nls ... capture next line
, '(.+?)(?=(State:|$))' ... capture everything up to "State:" or EOF
];
cac = regexp( str, xpr, 'tokens' );
len = length( cac );
for jj = 1 : len
buf = textscan( cac{jj}{3} ...
, '%98c%s' ...
, 'Whitespace', '' ...
, 'Delimiter' , '\n' ...
);
sub_len = size( buf{1}, 1 );
tmp = repmat( {''}, 1,sub_len );
sub = struct( 'State' ,tmp, 'Lat' , [], 'Lon' , [] ...
, 'COOP' , [], 'StnID', '', 'Station', '' ...
, 'County', '', 'Elev' , [], 'data' , [] );
for ss = 1 : sub_len
sub(ss).Lat = str2num( buf{1}(ss, 1: 6) ); %#ok<*ST2NM>
sub(ss).Lon = str2num( buf{1}(ss, 7:14) ); %#ok<*AGROW>
sub(ss).COOP = str2num( buf{1}(ss,15:21) );
sub(ss).StnID = strtrim( buf{1}(ss,23:31) );
sub(ss).State = strtrim( buf{1}(ss,32:34) );
sub(ss).Station = strtrim( buf{1}(ss,35:66) );
sub(ss).County = strtrim( buf{1}(ss,67:91) );
sub(ss).Elev = str2num( buf{1}(ss,92:98) );
sub(ss).data = str2num( buf{2}{ss} );
end
sas.(cac{jj}{1}) = sub;
end
if not( length(fieldnames(sas)) == length(strfind(str,'State:')) )
warning(['The number of appearances of "State:": %d\n' ...
, 'does not agree with the number of parsed sections: %d'] ...
, length(strfind(str,'State:')), length(fieldnames(sas)) )
end
end
Comments on the function:
  • the entire file is read into a string variable, str
  • each section of the file is split into three subsections with regexp. A section is the string between State: and State: (or end of string). Hopefully, State: doesn't appear elsewhere in the string.
  • loop over all states and for each state loop over all stations. Each row in the file ends up in one element of a structure array. All data are included.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!