Best way to parse text file

Question

Jesse Program on 7 Apr 2015

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/196582-best-way-to-parse-text-file

Edited: per isakson on 14 Apr 2015

I am trying to write some code to parse large text files of weather data. The formatting of the text file is broken into blocks, by state. The data I'm working with is here

Within each block (state), the .txt file formatting is consistent and should be fairly straight forward to parse. However, each block has an inconsistent number of rows (varying amount of reporting weather stations across states, and time) it not clear to me that each block is delimited from one another. I'm trying to get ultimately arrive at an N x 3 matrix, with columns for state, county, and total snowfall.

I'm not sure which functions are best for this purpose. Can someone point me in the right direction? I don't know where to start....

Thanks

5 Comments
Show 3 older commentsHide 3 older comments

Jesse Program on 10 Apr 2015

Hi Per,

I saw you posted some code earlier, but now it's gone?

Thanks

per isakson on 10 Apr 2015

Edited: per isakson on 10 Apr 2015

Yes, but there was a major "issue" and no time to fix it. Now, I have posted a new code.

Sign in to comment.

Sign in to answer this question.

Answer 1

per isakson on 10 Apr 2015

1
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/196582-best-way-to-parse-text-file#answer_174734

Edited: per isakson on 14 Apr 2015

Open in MATLAB Online

NOAA's National Climatic Data Center (NCDC) (and others) publish important data in text files, which cannot be easily read with Matlab. Should I blame NOAA or The Mathworks?

OP provided a sample file, which contains weather data of Mars 2015. I have counted columns and positions and made mistakes. Surely NOAA provides a description of the file format. However, I didn't find it. It would have helped me to make a better code.

The file has fixed width columns and the values are padded with spaces.
The first column contains numerical data with leading spaces.
Missing numerical data are replaced by "-9999" and missing string data are replaced by spaces.
The sections have different number of columns.

The Matlab documentation isn't enough to write the code. One has to test and accept that Matlab will have the final word. Example: My interpretation of the documentation (R2013b) made me believe that

textscan( str, '%98c%*[^\n]' )

reads the first 98 characters of a string. However, that depends on whether there are leading spaces.

textscan( str, '%98c%*[^\n]', 'Whitespace','' )

does indeed read the first 98 characters of a string. But then I cannot use the spaces in the second part of the row. Issues like this one leads me into a trial and error path, which certainly does not end up in the best code.

&nbsp

"Best way to parse text file?" &nbsp My criteria

return the content of the file in a way that is suitable for further processing
easy to understand in a year from now
usable next time there is need to read this type of file
throw some kind of message if it fails to read a file (beware of automagic)
the function shall read and parse and nothing else. Selection and sorting is left to a following step

"Something like a struct for each state and then a sub-variable for each county" &nbsp ... ...

    >> sas = snow('dlysnfl.txt')
    sas = 
        AL: [1x23 struct]
        AR: [1x119 struct]
        AZ: [1x60 struct]
        CA: [1x103 struct]
        ...
    >> sas.AL
    ans = 
    1x23 struct array with fields:
        State
        Lat
        Lon
        COOP
        StnID
        Station
        County
        Elev
        data
    >> is_MADISON = strcmp('MADISON',{sas.('AL').County});
    >> cat( 1, sas.('AL')(is_MADISON).data )
    ans =
      Columns 1 through 8
        0    0    0    0    1    0    0 -9999
        0    0    0    0    0    0    0     0
        ...
    >> sas.('AL')(is_MADISON).Station
    ans =
    HUNTSVILLE INTL AP
    ans =
    OWENS CROSS ROADS 3S

where

    function    sas = snow( filespec )
        str = fileread( filespec );
        spc = '[ ]*';
        nls = '[\r\n]+';
        xpr = [ '(?<=State:)',spc,'(\w+)' ... match "State:" and capture following  
            ,   spc, nls                  ... group of letters  
            ,   '([^\r\n]+)', nls         ... capture next line
            ,   '(.+?)(?=(State:|$))'     ... capture everything up to "State:" or EOF
            ];
        cac = regexp( str, xpr, 'tokens' );
        len = length( cac );
        for jj = 1 : len
            buf = textscan( cac{jj}{3}          ...
                        ,   '%98c%s'            ...
                        ,   'Whitespace', ''    ...
                        ,   'Delimiter' , '\n'  ...
                        ); 
            sub_len = size( buf{1}, 1 ); 
            tmp = repmat( {''}, 1,sub_len );    
            sub = struct(  'State' ,tmp, 'Lat'  , [], 'Lon'    , []  ...
                         , 'COOP'  , [], 'StnID', '', 'Station', ''  ...
                         , 'County', '', 'Elev' , [], 'data'   , []  );
            for ss = 1 : sub_len
                sub(ss).Lat     = str2num( buf{1}(ss, 1: 6) );  %#ok<*ST2NM>
                sub(ss).Lon     = str2num( buf{1}(ss, 7:14) );  %#ok<*AGROW>
                sub(ss).COOP    = str2num( buf{1}(ss,15:21) );
                sub(ss).StnID   = strtrim( buf{1}(ss,23:31) );
                sub(ss).State   = strtrim( buf{1}(ss,32:34) );
                sub(ss).Station = strtrim( buf{1}(ss,35:66) );
                sub(ss).County  = strtrim( buf{1}(ss,67:91) );
                sub(ss).Elev    = str2num( buf{1}(ss,92:98) ); 
                sub(ss).data    = str2num( buf{2}{ss} );
            end
            sas.(cac{jj}{1}) = sub;
        end
        if not( length(fieldnames(sas)) == length(strfind(str,'State:')) )
            warning(['The number of appearances of "State:": %d\n'              ...
                ,    'does not agree with the number of parsed sections: %d']   ...
                ,    length(strfind(str,'State:')), length(fieldnames(sas))     )
        end
    end

Comments on the function:

the entire file is read into a string variable, str
each section of the file is split into three subsections with regexp. A section is the string between State: and State: (or end of string). Hopefully, State: doesn't appear elsewhere in the string.
loop over all states and for each state loop over all stations. Each row in the file ends up in one element of a structure array. All data are included.

1 Comment
Show -1 older commentsHide -1 older comments

Jesse Program on 13 Apr 2015

Very helpful, thanks

Sign in to comment.

Best way to parse text file

5 Comments
Show 3 older commentsHide 3 older comments

Accepted Answer

1 Comment
Show -1 older commentsHide -1 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

Best way to parse text file

5 Comments Show 3 older commentsHide 3 older comments

Accepted Answer

1 Comment Show -1 older commentsHide -1 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

5 Comments
Show 3 older commentsHide 3 older comments

1 Comment
Show -1 older commentsHide -1 older comments