How can I extract relevant data from a text file using specific strings to identify relevant blocks of data?

1 view (last 30 days)
I am attempting to extract data in the form of mean estimates and complementary cumulative distribution functions (CCDFs) for specific variables. The data reside in text files generated as output from a simulation code.
Each output text file contains several thousand records, many of which are not of interest. I need to be able to:
1. Search through the text file to identify specific records based on record type. Each record begins with a line that has the string 'BOR'. The second line specifies the record type (e.g., '0006 STAT_RESULTS' or '0007 CCDF_RESULTS' - which are the only record types I care about for my analysis).
2. Within the two record types of interest:
(1) Identify specific combinations of variables and associated qualifiers. The third line in the record specifies the variable using a tab-delimited string comprised of two elements/columns using the following form: '12 :Health Effect Cases'. The fourth through nth lines in the record identify qualifiers (e.g., boundaries of spatial intervals, values of other variables on which the variable depends) associated with the variable specified in the third line using a tab-delimited string comprised of three elements/columns using the following form: '2 "0.000000E+00" :QualifierDesc "Ldistance (mi)" QualifierValue;' - where the first element/column (integer) represents the qualifier ID, the second element/column (value in quotes) represents the qualifier value, and the string in double quotes in the third element/column represents the qualifier description. As a clarifying example, I might be interested in identifying the health effect cases predicted to occur in the spatial interval between 0 and 10 miles from a facility.
(2) Extract the relevant numeric results to MATLAB objects for subsequent manipulation and visualization. For example, the mean values I am interested in extracting can be found exactly 3 lines below the nth line that includes information about the last qualifier in the '0006 STAT_RESULTS' record type; the line is a tab-delimited string comprised of two elements/columns using the following form: '0 :Mean'. Likewise, the CCDF data can be found exactly 1 through 46 lines below the nth line that includes information about the last qualifier in the '0007 CCDF_RESULTS" record type: each of the 46 lines is a tab-delimited string comprised of three elements/columns using the following form: '1.00E-09 0.94579676 :X(1,1)X(1,2)' - where the first two elements represent one x,y pair of the CCDF.
Any help on how to perform these tasks would be much appreciated
  2 Comments
Image Analyst
Image Analyst on 30 May 2015
You forgot to attach your data file. You can attach a shortened version of it if you want. Otherwise no one will probably do or try anything for you.
Dan Hudson
Dan Hudson on 30 May 2015
I originally did not attach the data file because it is so large (there are 170,000 lines of text), and hoped my description would be sufficient.
I have attached a modified version in which I deleted the vast majority of data and made changes to the results that are included. The intent is to illustrate the format of the data that I am not interested in and the format of the data that I am interested in. The last two records in the attachment provide illustrative (though modified) examples of the types of data that I need to extract for analysis.

Sign in to comment.

Answers (1)

Walter Roberson
Walter Roberson on 30 May 2015
If you use textscan() then there is at least one trick:
If you can identify something unique about the first of each kind of line that you want to skip, and something unique about the last of each kind of line that you want to skip, then you can use those unique strings as a pair in a CommentStyle parameter.. sometimes anyhow. Best if it is something at the beginning of the line.
  2 Comments
Walter Roberson
Walter Roberson on 30 May 2015
If this is something that you needed to do to a lot of files, then I would use completely different tools. I would use Lex and Yacc, which build C code that has really efficient Finite State Machines that even tolerate temporary ambiguity in state and nested definitions.
If it were fewer files, and if the parsing isn't quite as bad, then I'd probably program it in perl, which can be called from MATLAB.
Often a good way to start for these kinds of tasks is to write a BNF, Backus-Naur Form which would involve building a structured representation of what goes in to each different record type. When you have a structured representation it is much easier to write code that parses it correctly.
Dan Hudson
Dan Hudson on 30 May 2015
Walter,
Thanks so much for taking the time to respond to my question. I will have approximately 40 files comprised of approximately 170,000 lines of data that I will need to do this for - and that is not accounting for the need to perform additional analyses with model changes and sensitivity analyses.
I will take your advice and work on developing a BNF as a useful starting point.
Thanks again!

Sign in to comment.

Categories

Find more on Data Type Conversion in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!