How do I parse this complex text file with textscan?

Question

0 votes

I have a text file that is in a rather funky format. The file comes out of a relational database (Antelope) and consists of earthquake location, dates, times, phase information, etc. I need to parse out and collect the 'data blocks' that are in between each header line. I need the header lines as well for each "block". I have edited the file to include an EOB (end of block) marker to make this task easier, but it's not as trivial as I thought. Here's an image of the first 68 or so lines (out of about 1 million).

I'd like to pull the 4 columns below each header.... for example the first section is:

 2015  1 22  0  8 58.537   45.97929 -129.98717  1.184  0.0  1.039  3.621  0.036      1
 AXCC1  0.843 1.00 P
 AXAS2  1.263 1.00 P
 AXEC1  0.923 1.00 P
 AXEC2  1.103 1.00 P
 AXEC3  1.088 1.00 P
 AXCC1  1.873 0.25 S
 AXAS1  2.728 0.06 S
 AXAS2  2.168 0.25 S
 AXEC1  1.708 0.33 S
 AXEC2  2.043 0.25 S
 AXEC3  2.113 0.25 S

and put those in an array. But I need to be able to associate the header line, specifically the last integer in the header line (1 in this case), with each code block.

So far my code looks like this, but obviously it is not working yet. I don't get any errors but it's missing and skipping data etc.

fid=fopen('ph2dt_catalog8_edit.dat');
Block=1;
while (~feof(fid))
      InputText=textscan(fid,'%s',1,'delimiter','\n');
      HeaderLines{Block,1}=InputText{1};
      disp(HeaderLines{Block});
      FormatString='%s%f%f%s'; 
      InputText=textscan(fid, FormatString, 'delimiter','WhiteSpace','CollectOutput',1);
      Data{Block,1} = cell2mat(InputText{2});    
      [NumRows,NumCols] = size(Data{Block}); 
      eob=textscan(fid,'%s',1,'delimiter','\n');
      Block=Block +1;
end

Can anyone offer any suggestions. Let me know if I need to clarify anything further.

14 Comments
Show 12 older comments Hide 12 older comments

psprinks on 18 Nov 2016

@Jan Wow...you are right. fgetl is not slow...I'm perpetuating false rumors.

my code took ~10 hours and yours took about 20 minutes...so that's pretty amazing.

My only concern now is working with the format of the output...cell arrays within cell arrays. Also this code didn't write the header lines to an array( but not a big deal because I have code I can splice in that does that), which I need.

I'm just a geophysicist hack when it comes to coding.

per isakson on 19 Nov 2016

Edited: per isakson on 19 Nov 2016

"I'd like to pull the 4 columns below each header" &nbsp Your script doesn't extract the third column. And what is the intent for &nbsp MATDAY_ARV=datenum(...) ?

Sign in to comment.

Sign in to answer this question.

Sign in to follow activity

Answer 1

per isakson on 18 Nov 2016

Edited: per isakson on 30 Jan 2021

Open in MATLAB Online

2 votes

Assumptions

Speed is important - "any ideas on faster method?"
The text file fits in memory - "The entire file is about 23 MB."
The station names are exactly five characters - "5" appears in the code as a magic number
The value of PHA is exactly one character
The line separator is "", i.e char(10)
The header lines begin with 2014,2015,2016 or 2017 (and are the only lines to begin so).

Approach

Read the entire file into a character string.
Split the string into a cell array of strings, with one block in each cell
Pre-allocate output variables based on the size of the string and the cell array
Loop over all blocks and parse one block at a time

I tested with community_edit_2.txt, which is community_edit.txt with the # removed.

STA and PHA are character arrays rather than cell arrays of strings. That's somewhat faster

function [ ORG, ARV, STA, PHA, EVD ] = cssm( filespec )
    str = fileread( filespec );
    xpr = '(?<=(^|\n))[ ]*201[4567].+?(?=($|[ ]*201[4567]))';
    blocks = regexp( str, xpr, 'match' );
    nnl = length( strfind( str, char(10) ) ); 
    len = length( blocks );
    ORG = nan(len,14); 
    %
    N   = nnl - len + 1;
    STA = repmat( '-', [N,5] ); 
    ARV = nan(N,1); 
    PHA = repmat( '-', [N,1] );  
    EVD = nan(N,1);
    nextORG = 1;
    nextSTA = 1;
    for cac = blocks
        S0  = regexp( cac{1}, '\n', 'split', 'once' );
        S1  = textscan( S0{1}, '%f%f%f%f%f%f%f%f%f%f%f%f%f%f' ...
                    ,   'CollectOutput',true                  );
        ORG( nextORG, : ) = S1{1};
        MATDAY_ARV = datenum( S1{1}(1:6) ); %#ok<NASGU> 
        nextORG = nextORG + 1;
        %
        S2  = textscan( S0{2}, '%5c%f%f%1c' );
        N2  = size( S2{1}, 1 );
        STA( nextSTA:nextSTA+N2-1, : ) = S2{1}; 
        ARV( nextSTA:nextSTA+N2-1, 1 ) = S2{2}; 
        PHA( nextSTA:nextSTA+N2-1, 1 ) = S2{4}; 
        EVD( nextSTA:nextSTA+N2-1, : ) = S1{1}(end); 
        nextSTA = nextSTA + N2;
    end
    %
    if  N >= nextSTA % truncate the "memory", which isn't used.  
        STA( STA == '-' ) = [];
        STA = reshape( STA, [],5 );
        ARV( nextSTA : end ) = [];
        PHA( nextSTA : end ) = [];
        EVD( nextSTA : end ) = [];
    end
end

Error handling: This file lacks error handling besides that of Matlab, e.g. fileread will tell if the text file is missing. If this function is intended for routine use it's important to handle especially the errors, which are caused by unexpected character strings in the input file.

2016-11-18, Performance test

Computer: eight year old vanilla desktop with 8GB RAM.
System: Windows7,64bit, Matlab R2016a,64bit
Test file: community_edit_1M.txt is 27.6MB, 95200 blocks, 1097181 lines. It's created by concatenating copies of community_edit.txt and removing the #.

>> filespec = 'h:\m\cssm\community_edit_1M.txt';
>> tic,[ORG0,ARV0,STA0,PHA0,EVD0] = cssm( filespec ); toc
Elapsed time is 22.443859 seconds.

Caveat: The text file was probably available in the system cache, since this was not cleared before the test.

Comparison: This is nearly five times faster than the function, asd

>> filespec = 'h:\m\cssm\community_edit_1M_EOB.txt';
>> tic, [Data, HeaderLines] = asd( filespec ); toc
Elapsed time is 101.202009 seconds.

11 Comments
Show 9 older comments Hide 9 older comments

per isakson on 20 Nov 2016

Edited: per isakson on 21 Nov 2016

Matlab has good debugging features, see Debug a MATLAB Program.

No, I cannot tell why you encounter this error. Instead of guessing, I'll try to help you find out.

set a breakpoint at line 18

&nbsp

start the function. Execution will halt at line 18
hover over the variable str. The tooltip will show the value of str, which should be "identical" to the content of the text file. There should not be any # or EOB.
double line spacing in the tooltip would indicate that the line separator is "\r", i.e char(13)+char(10). If so, execute double(str(90:145)) and look for the pair 13 10 in the output.

&nbsp

hover over the variable blocks. (The exact content of the tooltip may differ between Matlab releases. I use R2016a.)

&nbsp

click Step three(?) times
hover over the variable S0

&nbsp

click Step until line 27
select the expression S0{2}
right-click and select Evaluate Selection

&nbsp

click Step and hover over S2

&nbsp

click Quit Debugging in the toolstrip

&nbsp

Now, I hope that you either were able to reproduce these steps or that you identified a difference, which explains why it went wrong.

If you were able to reproduce these steps, the next steps are

click Breakpoints in the toolstrip (/toolbar)
click Clear All
click Stop on Errors
run the function and it will halt just in advance of throwing the error
hover over str, blocks, cac and S1
select the expression S0{2}, right-click and select Evaluate Selection

By now you should know more about the cause of the problem

/over

psprinks on 21 Nov 2016

Edited: psprinks on 21 Nov 2016

Open in MATLAB Online

community.txt

Per,

I can't thank you enough for your effort. You are correct this hasn't been the most effective way to communicate, and I'm forgetting that English isn't everyone's first language. (my bad)

So, I am familiar with debugging and I have isolated the problem.

When the code reaches the 997th header line it doesn't read the full line. It stops after it reads the longitude value. So S0 becomes a 1x1 cell instead of the 1x2 that it should be. Therefore, when Matlab gets to S2 it can't evaluate S0{2} because it doesn't exist.

Further, the issue is with:

xpr = '(?<=(^|\n))[ ]*201[4567].+?(?=($|[ ]*201[4567]))';

What's happening is that anytime the characters 2014, 2015, 2016 or 2017 appear in the header line after the beginning of the line the code is cutting the rest of the header line. This will happen at several points throughout the rest of the code. For example the last characters in the 12015th header line are 2015 and the xpr assignment says to drop those characters and so cac, blocks, S0 aren't correct. I hope this makes sense.

I am attaching a much larger portion of the text file so you can see what I'm seeing.

Also I've put an image of my workspace so you can see that S0{2} doesn't exist.

per isakson on 21 Nov 2016

Edited: per isakson on 22 Nov 2016

Open in MATLAB Online

You found a bug in my code and you spotted the erroneous expression: "the characters 2014, 2015, 2016 or 2017 appear in the header line". However, let me show you how I would track it down.

set Stop on Errors and run
execution halted at line 32
select cac{1} and evaluate. The block is truncated in the header line as you already found "reaches the 997th header line it doesn't read the full line."

&nbsp

search for the value 45.93929 in the file. There is hopefully only few of it in the file. I use Notepad++ to inspect data files.

&nbsp

The block is truncated just before 2017. And that is done by

xpr = '(?<=(^|\n))[ ]*201[4567].+?(?=($|[ ]*201[4567]))';
blocks = regexp( str, xpr, 'match' );

The error is in the look ahead part, (?=($|[ ]*201[4567]). It matches 2017 in any position, not only in the beginning of a line. A \n before 2017 is missing. Replace the expression by

xpr = '(?<=(^|\n))[ ]*201[4567].+?(?=($|(\n[ ]*201[4567])))';

which has an extra pair of parentheses for readability. Now the "look ahead" looks for either the end of the entire string or a new line followed by zero or more spaces followed by 201 followed by one of 4567.

Now the function reads the current data file

>> filespec = 'h:\m\cssm\community_20161121.txt';
>> tic,[ORG0,ARV0,STA0,PHA0,EVD0] = cssm( filespec ); toc
Elapsed time is 0.199797 seconds.
>> whos ORG0
  Name        Size            Bytes  Class     Attributes
  ORG0      826x14            92512  double

psprinks on 21 Nov 2016

AWESOME! This literally saved me days of processing time!!!

per isakson on 22 Nov 2016

Edited: per isakson on 22 Nov 2016

I'm glad the function is useful and will be used!

You had already spotted the expression with the bug: "the characters 2014, 2015, 2016 or 2017 appear in the header line". I could have save me the details in my last comment. However, I was kind of occupied of describing a complete debugging session, hopefully, to the benefit of some other reader.

Sign in to comment.

Answer 2

Jan on 17 Nov 2016

Edited: Jan on 18 Nov 2016

Open in MATLAB Online

3 votes

fscanf might be easier then textscan:

[EDITED: bugs removed]

function [Data, HeaderLines] = asd(FileName)
fid = fopen(FileName, 'r');
if fid == -1
   error('Cannot open file: %s', FileName);
end
maxBlocks    = 10000;        % Is this sufficient? Better too large.
HeaderLines  = cell(1, maxBlocks);
Data         = cell(1, maxBlocks);
iBlock       = 0;
aBlock       = cell(1, 20);  % Or largest number of lines per block
while ~feof(fid)
   iBlock = iBlock + 1;
   Line = fgetl(fid);
   if ~ischar(Line)
      break;
   end
   HeaderLines{iBlock} = Line;
     isEOB = false;
     iData = 0;
     while ~isEOB && ~feof(fid)
        Line = fgetl(fid);
        if ~ischar(Line) || strncmp(Line, 'EOB', 3)
           isEOB = true;
        else
           iData = iData + 1;
           len                  = length(Line);
           [s1, num, err, ind1] = sscanf(Line, '%s', 1);
           [f,  num, err, ind2] = sscanf(Line(ind1:len), '%f', 2);
           s2                   = sscanf(Line(ind1+ind2:len), '%s');
           aBlock{iData}        = {s1, f(1), f(2), s2};
           % Parse = textscan(Line, ' %s %f %f %s');
           % aBlock{iData} = {Parse{1}{1}, Parse{2:3}, Parse{4}{1}};
        end
     end
     Data{iBlock} = aBlock(1:iData);  % Crop the data block
  end
fclose(fid);
Data        = Data(1:iBlock);
HeaderLines = strtrim(HeaderLines(1:iBlock));
end

6 Comments
Show 4 older comments Hide 4 older comments

FishermanJack on 9 Nov 2017

@Jan Simon... i am pretty new with Matlab and because i have a similar Problem and it seems that your Code should work could you Comment the Lines for easier understandig. thanks

Jan on 9 Nov 2017

@FishermanJack: This would be very inefficient. I could spend hours with mentioning all details I know about the code lines. Most of the commands are trivial and I cannot guess, which commands are not clear to you. So better use the debugger to step through the code line by line, see, what happens in which order and read the documentation of command which are not clear. If any details are not clear afterwards, ask a specific question.

Sign in to comment.

Answer 3

dpb on 18 Nov 2016

Edited: dpb on 18 Nov 2016

Open in MATLAB Online

1 vote

OK, for your file I used a grep utility first to find the EOB markers and then computed the numbers for each group...within Matlab it looked like--

>> cmd='grep -n EOB community_edit.txt >blocks.txt';
>> eob=textread('blocks.txt','%d:EOB');
>> neob=diff([0;eob])-2;
>> neob(1:10)'  % see if looks ok...
ans =
  11    10    11    12    10    12    13    12     8     8

That agrees with the number I get counting in editor.

Now, with that, read the first header and block then repeat for the 2:length(neob) remaining blocks with a header line (the EOB marker that's missing first group).

fmt1=repmat('%f',1,14); % header line
fmt2='%s%f%f%s';        % block data
fid=fopen(...
hdrs=zeros(neob,14);  % room for the headers
hdrs=cell2mat(textscan(fid,fmt1,1,'collectoutput',1));
blks=textscan(fid,fmt2,neob(1),'collectoutput',1);
for i=2:length(neob)
    hdrs(i,:)=cell2mat(textscan(fid,fmt1,1,'headerlines',1,'collectoutput',1));
    blks(i)=textscan(fid,fmt2,neob(i),'collectoutput',1);
end
fid=fclose(fid);

Should be quite a bit quicker reading over fgetl.

2 Comments
Show None Hide None

psprinks on 18 Nov 2016

Open in MATLAB Online

thanks dpb

I'm trying to implement your code but it's throwing this:

Error using diff
Function 'diff' is not supported for class 'cell'.

The output from

   eob=textscan('blocks.txt','%d:EOB');
is a cell.

dpb on 18 Nov 2016

Edited: dpb on 18 Nov 2016

Oh, yeah, I forgot when I used the "approved" textscan over the deprecated textread that I use for simple cases to wrap the RHS in cell2mat to convert the cell to double array. Or, of course, you can use {:} to dereference the cell. But, my solutions in preferred order are--

eob=textread('blocks.txt','%d:EOB'); % returns double directly
eob=cell2mat(textscan(fid,'%d:EOB')); % ditto but cast req'd to do so(*) plus fopen/fclose hoopla
neob=diff(eob{:}); % pain to dereference needless cell array w/o 1 or 2

() Actually, may also need _'collectoutput',1 as well, I forget what *textscan does by default for single value; if it's a cell of Nx1 or N cell 1x1 (or if that even matters in dereferencing; I try to avoid cell arrays like the plague so always have to 'spearmint to remember the rulez).

Sign in to comment.

How do I parse this complex text file with textscan?

14 Comments
Show 12 older comments Hide 12 older comments

Accepted Answer

11 Comments
Show 9 older comments Hide 9 older comments

More Answers (2)

6 Comments
Show 4 older comments Hide 4 older comments

2 Comments
Show None Hide None

Categories

Products

Tags

Community Treasure Hunt

How do I parse this complex text file with textscan?

14 Comments Show 12 older comments Hide 12 older comments

Accepted Answer

11 Comments Show 9 older comments Hide 9 older comments

More Answers (2)

6 Comments Show 4 older comments Hide 4 older comments

2 Comments Show None Hide None

Categories

Products

Tags

See Also

Community Treasure Hunt

14 Comments
Show 12 older comments Hide 12 older comments

11 Comments
Show 9 older comments Hide 9 older comments

6 Comments
Show 4 older comments Hide 4 older comments

2 Comments
Show None Hide None