How to parse poorly formatted txt file?

Hello,
I'm working with modeling software that outputs data in badly formatted .txt files, there is a screenshot of the output below. Each output file contains 10,000 data blocks which begin with the highlighted "1tally" string, and end several lines later with a decimal number.
Ideally I need to be able to pull the 1tally string, the floating number following it (14 and 24 in the picture), and the remaining two values in each data set (6.47566E-07 0.0187 and 6.93514E-07 0.0181). I've tried using textscan options to locate the 1tally string but I'm not familiar enough with matlab to write the loop to keep seeking the remaining 9,999 data blocks. I can't use the 'HeaderLines' option because the entries are not on the same row in every file, and even in a single file the number of rows between data blocks will vary anywhere between 1 and 500.
Any help or advice would be greatly appreciated.
Edit: I can't post the full output file, but I've attached a shortened version. The formatting is the same as what I need to the code for. The only difference would be the number of rows between the start of the file and the first occurrence of 1tally

2 Comments

@Jess: can you please edit your question and upload a sample file by clicking the paperclip button above the textbox. Then we can test code on a real file.
@Stephen Cobeldick: thanks for the advice, I've added an example of the file.

Sign in to comment.

 Accepted Answer

I would probably use fileread() to read the entire file into a string, and then I would probably use regexp() with named tokens. It might not be bad... something like
regexp(S, '(?<=1tally\s+)(?<tallyno>\d+)(?:.*?)(?<last2>\S+)(?:\s+)(?<last1>\S+)(?=\s+=)', 'names')
This looks for 1tally followed by whitespace, then puts the decimal digits that follow that into the field 'tallyno'. Then it skips as few characters as possible to satisfy what comes after. Then it captures a bunch of non-whitespace items into a field named 'last2', after which it skips whitespace and then captures a bunch of non-whitespace items into a field named 'last1'. After that it skips whitespace, and after that it is mandatory that there is an "="
I would need an extract of the file to test the expression to be certain.

10 Comments

In addition to what Walter suggested, I'd contact the author of the program that generates the poorly formatted file and ask them:
  1. If they have a function that is designed to read in the data in the format the program produces, and/or
  2. If they can modify the program to regularize the format, and/or
  3. If there's an option of which you are unaware that can regularize the format (possibly changing from a plain text file to something like XML.)
Some applications can output MATLAB ‘.mat’ files as well.
Jess
Jess on 30 Mar 2017
Edited: Jess on 30 Mar 2017
@Walter: Thanks, I'll play around with that and see if it makes a difference.
@ Steven Lord: Unfortunately those options aren't really viable here. I'm using the MCNP6 code from Los Alamos. The group developing the code is well aware of how ugly and not user-friendly the output files are, but they have no interest in changing from a format that is decades old.
There are user-built macros in excel and other codes (like matlab programs) for data processing. These tend to be specific to the user's problem, and are set-up for a different text file than the one I'm trying to work with.
@Walter: I've been testing out the regexp option you posted, and it worked perfectly for the shortened files. After 30 minutes I'm still waiting for it to finish going through a complete file, but using S = fileread('output') matlab says S is < 1x59912805 char > so I'm not surprised.
If you use
t = regexp(S, {'(?<=1tally\s+)(?<tallyno>\d+)', '(?<=^s+cell.*$\r?\n)(?:\s+)(?<last2>\S+)(?:\s+)(?<last1>\S+)(?=\s+=)'}, 'tokens', 'lineanchors', 'dotexceptnewline');
t2 = [vertcat(t{1}{:}),vertcat(t{2}{:})];
then it might possibly be faster. The result will be an N x 3 cell array of strings, not a struct array.
The reason it might be faster is that in this version .* is confined to the same line and you look specifically for lines beginning with space followed by "cell"; these two together should cut down a lot on the amount of back-tracking the expressions need to do.
Thanks Walter, I'll give that a try.
I was also experimenting earlier with the number of data sets present in the output file, and it looks like the command you provided works perfectly until the output file contains around 1000 data sets or more. Once it reaches this point, it looks like matlab gets stuck in the "Busy" status and can't finish. Is there something else I could do to fix the issue?
The only thing that is coming to mind about 1000 or so is to wonder whether the format changes slightly in that data, resulting in an extended need to backtrack ? If backtracking is the problem, then I would expect my second version to work better.
I would suggest that you consider pre-parsing the data. For example if you have awk (OS-X or Linux) then
awk '/^1tally.*[0-9]/ {print $2}; /^ cell/ {getline; print $1; print $2}' ArrayData.txt
This would produce an output of
14
6.47566E-07
0.0187
24
6.93514E-07
0.0181
34
7.03193E-07
0.0180
The space between parts is not deliberate; there is a carriage return in there because you happen to have CR in your input. I had difficulty getting awk to handle CR.
Much the same thing can be done with perl, which is always installed with MATLAB.
awk and perl should be very fast at processing text like this. Then MATLAB should be able to handle the resulting text easily.
Jess
Jess on 3 Apr 2017
Edited: Jess on 3 Apr 2017
Alright, I'm back again! I really do appreciate all of your help with this.
I installed GnuWin32 so I could just use the awk command and have been working with what you provided. I'm working on Windows 7, but can't get the command to move beyond printing the tally numbers. Removing the apostrophes, I can get your line to run, but only if I remove '{print $2}; /^ cell/ {getline; print $1; print $2}'. If I don't do this, the command line prints an error saying: "fatal: cannot open file '{print' for reading."
Try creating a file parsemcn.awk with content
/^1tally.*[0-9]/ {print $2};
/^ cell/ {getline; print $1; print $2}
Then
gawk -f parsemcn.awk ArrayData.txt > SomeOutputFile.txt
That did the trick! You have no idea just how much time you've saved me with your help and advice!

Sign in to comment.

More Answers (0)

Asked:

on 29 Mar 2017

Commented:

on 4 Apr 2017

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!