MATLAB Answers

How do I parse this complex text file with textscan?

109 views (last 30 days)
psprinks on 16 Nov 2016
Edited: per isakson on 30 Jan 2021
I have a text file that is in a rather funky format. The file comes out of a relational database (Antelope) and consists of earthquake location, dates, times, phase information, etc. I need to parse out and collect the 'data blocks' that are in between each header line. I need the header lines as well for each "block". I have edited the file to include an EOB (end of block) marker to make this task easier, but it's not as trivial as I thought. Here's an image of the first 68 or so lines (out of about 1 million).
I'd like to pull the 4 columns below each header.... for example the first section is:
2015 1 22 0 8 58.537 45.97929 -129.98717 1.184 0.0 1.039 3.621 0.036 1
AXCC1 0.843 1.00 P
AXAS2 1.263 1.00 P
AXEC1 0.923 1.00 P
AXEC2 1.103 1.00 P
AXEC3 1.088 1.00 P
AXCC1 1.873 0.25 S
AXAS1 2.728 0.06 S
AXAS2 2.168 0.25 S
AXEC1 1.708 0.33 S
AXEC2 2.043 0.25 S
AXEC3 2.113 0.25 S
and put those in an array. But I need to be able to associate the header line, specifically the last integer in the header line (1 in this case), with each code block.
So far my code looks like this, but obviously it is not working yet. I don't get any errors but it's missing and skipping data etc.
while (~feof(fid))
InputText=textscan(fid, FormatString, 'delimiter','WhiteSpace','CollectOutput',1);
Data{Block,1} = cell2mat(InputText{2});
[NumRows,NumCols] = size(Data{Block});
Block=Block +1;
Can anyone offer any suggestions. Let me know if I need to clarify anything further.
per isakson
per isakson on 19 Nov 2016
"I'd like to pull the 4 columns below each header" &nbsp Your script doesn't extract the third column. And what is the intent for &nbsp MATDAY_ARV=datenum(...) ?

Sign in to comment.

Accepted Answer

per isakson
per isakson on 18 Nov 2016
Edited: per isakson on 30 Jan 2021
  • Speed is important - "any ideas on faster method?"
  • The text file fits in memory - "The entire file is about 23 MB."
  • The station names are exactly five characters - "5" appears in the code as a magic number
  • The value of PHA is exactly one character
  • The line separator is "", i.e char(10)
  • The header lines begin with 2014,2015,2016 or 2017 (and are the only lines to begin so).
  • Read the entire file into a character string.
  • Split the string into a cell array of strings, with one block in each cell
  • Pre-allocate output variables based on the size of the string and the cell array
  • Loop over all blocks and parse one block at a time
I tested with community_edit_2.txt, which is community_edit.txt with the # removed.
STA and PHA are character arrays rather than cell arrays of strings. That's somewhat faster
function [ ORG, ARV, STA, PHA, EVD ] = cssm( filespec )
str = fileread( filespec );
xpr = '(?<=(^|\n))[ ]*201[4567].+?(?=($|[ ]*201[4567]))';
blocks = regexp( str, xpr, 'match' );
nnl = length( strfind( str, char(10) ) );
len = length( blocks );
ORG = nan(len,14);
N = nnl - len + 1;
STA = repmat( '-', [N,5] );
ARV = nan(N,1);
PHA = repmat( '-', [N,1] );
EVD = nan(N,1);
nextORG = 1;
nextSTA = 1;
for cac = blocks
S0 = regexp( cac{1}, '\n', 'split', 'once' );
S1 = textscan( S0{1}, '%f%f%f%f%f%f%f%f%f%f%f%f%f%f' ...
, 'CollectOutput',true );
ORG( nextORG, : ) = S1{1};
MATDAY_ARV = datenum( S1{1}(1:6) ); %#ok<NASGU>
nextORG = nextORG + 1;
S2 = textscan( S0{2}, '%5c%f%f%1c' );
N2 = size( S2{1}, 1 );
STA( nextSTA:nextSTA+N2-1, : ) = S2{1};
ARV( nextSTA:nextSTA+N2-1, 1 ) = S2{2};
PHA( nextSTA:nextSTA+N2-1, 1 ) = S2{4};
EVD( nextSTA:nextSTA+N2-1, : ) = S1{1}(end);
nextSTA = nextSTA + N2;
if N >= nextSTA % truncate the "memory", which isn't used.
STA( STA == '-' ) = [];
STA = reshape( STA, [],5 );
ARV( nextSTA : end ) = [];
PHA( nextSTA : end ) = [];
EVD( nextSTA : end ) = [];
Error handling: This file lacks error handling besides that of Matlab, e.g. fileread will tell if the text file is missing. If this function is intended for routine use it's important to handle especially the errors, which are caused by unexpected character strings in the input file.
2016-11-18, Performance test
  • Computer: eight year old vanilla desktop with 8GB RAM.
  • System: Windows7,64bit, Matlab R2016a,64bit
  • Test file: community_edit_1M.txt is 27.6MB, 95200 blocks, 1097181 lines. It's created by concatenating copies of community_edit.txt and removing the #.
>> filespec = 'h:\m\cssm\community_edit_1M.txt';
>> tic,[ORG0,ARV0,STA0,PHA0,EVD0] = cssm( filespec ); toc
Elapsed time is 22.443859 seconds.
Caveat: The text file was probably available in the system cache, since this was not cleared before the test.
Comparison: This is nearly five times faster than the function, asd
>> filespec = 'h:\m\cssm\community_edit_1M_EOB.txt';
>> tic, [Data, HeaderLines] = asd( filespec ); toc
Elapsed time is 101.202009 seconds.
per isakson
per isakson on 22 Nov 2016
I'm glad the function is useful and will be used!
You had already spotted the expression with the bug: "the characters 2014, 2015, 2016 or 2017 appear in the header line". I could have save me the details in my last comment. However, I was kind of occupied of describing a complete debugging session, hopefully, to the benefit of some other reader.

Sign in to comment.

More Answers (2)

Jan on 17 Nov 2016
Edited: Jan on 18 Nov 2016
fscanf might be easier then textscan:
[EDITED: bugs removed]
function [Data, HeaderLines] = asd(FileName)
fid = fopen(FileName, 'r');
if fid == -1
error('Cannot open file: %s', FileName);
maxBlocks = 10000; % Is this sufficient? Better too large.
HeaderLines = cell(1, maxBlocks);
Data = cell(1, maxBlocks);
iBlock = 0;
aBlock = cell(1, 20); % Or largest number of lines per block
while ~feof(fid)
iBlock = iBlock + 1;
Line = fgetl(fid);
if ~ischar(Line)
HeaderLines{iBlock} = Line;
isEOB = false;
iData = 0;
while ~isEOB && ~feof(fid)
Line = fgetl(fid);
if ~ischar(Line) || strncmp(Line, 'EOB', 3)
isEOB = true;
iData = iData + 1;
len = length(Line);
[s1, num, err, ind1] = sscanf(Line, '%s', 1);
[f, num, err, ind2] = sscanf(Line(ind1:len), '%f', 2);
s2 = sscanf(Line(ind1+ind2:len), '%s');
aBlock{iData} = {s1, f(1), f(2), s2};
% Parse = textscan(Line, ' %s %f %f %s');
% aBlock{iData} = {Parse{1}{1}, Parse{2:3}, Parse{4}{1}};
Data{iBlock} = aBlock(1:iData); % Crop the data block
Data = Data(1:iBlock);
HeaderLines = strtrim(HeaderLines(1:iBlock));
Jan on 9 Nov 2017
@FishermanJack: This would be very inefficient. I could spend hours with mentioning all details I know about the code lines. Most of the commands are trivial and I cannot guess, which commands are not clear to you. So better use the debugger to step through the code line by line, see, what happens in which order and read the documentation of command which are not clear. If any details are not clear afterwards, ask a specific question.

Sign in to comment.

dpb on 18 Nov 2016
Edited: dpb on 18 Nov 2016
OK, for your file I used a grep utility first to find the EOB markers and then computed the numbers for each group...within Matlab it looked like--
>> cmd='grep -n EOB community_edit.txt >blocks.txt';
>> eob=textread('blocks.txt','%d:EOB');
>> neob=diff([0;eob])-2;
>> neob(1:10)' % see if looks ok...
ans =
11 10 11 12 10 12 13 12 8 8
That agrees with the number I get counting in editor.
Now, with that, read the first header and block then repeat for the 2:length(neob) remaining blocks with a header line (the EOB marker that's missing first group).
fmt1=repmat('%f',1,14); % header line
fmt2='%s%f%f%s'; % block data
hdrs=zeros(neob,14); % room for the headers
for i=2:length(neob)
Should be quite a bit quicker reading over fgetl.
dpb on 18 Nov 2016
Oh, yeah, I forgot when I used the "approved" textscan over the deprecated textread that I use for simple cases to wrap the RHS in cell2mat to convert the cell to double array. Or, of course, you can use {:} to dereference the cell. But, my solutions in preferred order are--
  1. eob=textread('blocks.txt','%d:EOB'); % returns double directly
  2. eob=cell2mat(textscan(fid,'%d:EOB')); % ditto but cast req'd to do so(*) plus fopen/fclose hoopla
  3. neob=diff(eob{:}); % pain to dereference needless cell array w/o 1 or 2
() Actually, may also need _'collectoutput',1 as well, I forget what *textscan does by default for single value; if it's a cell of Nx1 or N cell 1x1 (or if that even matters in dereferencing; I try to avoid cell arrays like the plague so always have to 'spearmint to remember the rulez).

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!