Parsing a text file in matlab and accessing contents of each sections

Question

yashvin on 10 Jun 2015

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/223200-parsing-a-text-file-in-matlab-and-accessing-contents-of-each-sections

Commented: yashvin on 12 Jun 2015

Hi I want to separate a text file into different sections in MATLAB which is quite big.

 - Ignore first set of lines
 - Then the data set is repeated
 - Access its content for a particular set of condition

For example, for a drag factor of 1.0 and fuel factor of 1.2, I want to find the corresponding alt for a particular weight.

Find attached the text file.

Thanks Yashvin

2 Comments
Show NoneHide None

per isakson on 10 Jun 2015

Edited: per isakson on 10 Jun 2015

"quite big" &nbsp how big compared to available memory?
"different sections" &nbsp what defines the beginning of a section? "V2500_A5"_ is that a fixed string, which defines the beginning of a new a section?

yashvin on 10 Jun 2015

It is 60mb of txt file. As an example, I am attaching a full section of a part of the txt file. The initial section until "Cruise at a given cost index" is unimportant.

Each section begins with "CLEAN CONFIGURATION" followed by a table.

For example, for drag factor=1,fuel factor=1,2 and ISA= =13,I want to access the table and get the corresponding weight.

All the parameters in the 'CLEAN CONFIGURATION', i want to treat them as field so that I can select for different conditions

Sign in to comment.

Sign in to answer this question.

Answer 1

per isakson on 10 Jun 2015

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/223200-parsing-a-text-file-in-matlab-and-accessing-contents-of-each-sections#answer_182311

Edited: per isakson on 11 Jun 2015

Open in MATLAB Online

Here is a function, which reads question2.txt and returns a struct vector. It might serve as a starting point.

>> out = cssm()
out = 
1x2 struct array with fields:
    DRAG_FACTOR
    FUEL_FACTOR
    Table
>> out(abs([out.DRAG_FACTOR]-1)<1e-6 & abs([out.FUEL_FACTOR]-1)<1e-6).Table(1:5,1:3)
ans =
   1.0e+04 *
    4.0000    0.0000    0.0211
    4.0500    0.0000    0.0212
    4.1000    0.0000    0.0213
    4.1500    0.0000    0.0214
    4.2000    0.0000    0.0215

where

function    out = cssm() 
      str = fileread( 'question2.txt' );
      section_separator = 'CLEAN CONFIGURATION';
      cac = strsplit( str, section_separator );
      len = length( cac );
      out = struct( 'DRAG_FACTOR',nan(1,len-1),  'FUEL_FACTOR',[], 'Table',[] );
      for jj = 2 : len
          out(jj-1) = handle_one_section_( cac{jj} );
      end
  end
  function    sas = handle_one_section_( str )
      sas = struct( 'DRAG_FACTOR',[],  'FUEL_FACTOR',[], 'Table',[] );
      sas.DRAG_FACTOR = excerpt_num_( str, 'DRAG FACTOR' );
      sas.FUEL_FACTOR = excerpt_num_( str, 'FUEL FACTOR' );
      sas.Table = excerpt_table_( str );
  end
  function  val = excerpt_num_( str, name )
      buf = regexp( str, [ '(?<=', name, ')', '[ ]+[\d\.]+' ], 'match', 'once' );
      val = str2double( buf );
  end
  function  val = excerpt_table_( str )
      %   Q&D, quick and dirty, search a numerical sequence, which is at least 100 character 
      %   long. PROBLEM: requires that the preceding line ends with a "non-numerical" 
      %   character and that the following line begins with a "non-numerical" character. 
      buf = regexp( str, '[\d\.\s]{100,}', 'match', 'once' );
      val = str2num( buf );
  end

&nbsp

Modified function based on comment

>> cssm
ans = 
1x2 struct array with fields:
    DRAG_FACTOR
    FUEL_FACTOR
    Table
    COST_INDEX
    ALTITUDE
    ISA

where

function    out = cssm() 
      str = fileread( 'question2.txt' );
      section_separator = 'CLEAN CONFIGURATION';
      cac = strsplit( str, section_separator );
      len = length( cac );
      out = struct( 'DRAG_FACTOR',nan(1,len-1), 'FUEL_FACTOR',[], 'Table',[]  ...
                  , 'COST_INDEX' ,[]          , 'ALTITUDE'   ,[], 'ISA'  ,[]  );
      for jj = 2 : len
          out(jj-1) = handle_one_section_( cac{jj} );
      end
  end
  function    sas = handle_one_section_( str )
      sas = struct( 'DRAG_FACTOR',[], 'FUEL_FACTOR',[], 'Table',[]   ...
                  , 'COST_INDEX' ,[], 'ALTITUDE'   ,[], 'ISA'  ,[]   );
      sas.DRAG_FACTOR = excerpt_num_( str, 'DRAG FACTOR' );
      sas.FUEL_FACTOR = excerpt_num_( str, 'FUEL FACTOR' );
      sas.COST_INDEX = excerpt_colon_separated_num_( str, 'COST INDEX' );
      sas.ALTITUDE   = excerpt_colon_separated_num_( str, 'ALTITUDE' );
      sas.ISA        = excerpt_colon_separated_num_( str, 'ISA' );
      sas.Table = excerpt_table_( str );
  end
  function  val = excerpt_num_( str, name )
      buf = regexp( str, [ '(?<=', name, ')', '[ ]+[\d\.]+' ], 'match', 'once' );
      val = str2double( buf );
  end
  function  val = excerpt_table_( str )
      %   Q&D, quick and dirty, search a numeric sequecne, which is at least 100 character 
      %   long. PROBLEM: requires that the preceeding line ends with a "non-numeric" 
      %   character and that the following line begins with a "non-numeric" character. 
      buf = regexp( str, '[\d\.\s]{100,}', 'match', 'once' );
      val = str2num( buf );
  end
  function  val = excerpt_colon_separated_num_( str, name )
      buf = regexp( str, [ '(?<=', name, ')', '(?:[ \:\-]+)([\d\.])+' ], 'tokens', 'once' );
      val = str2double( buf{:} );
  end

9 Comments
Show 7 older commentsHide 7 older comments

per isakson on 11 Jun 2015

@Guillaume, yes the two text files differed. The first is a stripped down version of the second. I attach the copies I used.

yashvin on 12 Jun 2015

HI! Do you still have the file? Yes! Now its clearer to me! Thanks so much! Yes both your answer were very helpful! I am getting used to it now. The first answer was of higher level! Thank you both for your contribution!

Sign in to comment.

Answer 2

Guillaume on 10 Jun 2015

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/223200-parsing-a-text-file-in-matlab-and-accessing-contents-of-each-sections#answer_182203

Open in MATLAB Online

Your text file is not really designed to be read by a computer. It's not very consistent (variable number of blank lines, variable number of spaces, inconsistent number format, etc.) which makes it difficult to parse efficiently.

So the first thing to look at is if you can get the same data in a format designed to be parsed by a computer: binary, json, xml, etc.

Failing that, the following works on the attached file, but because of the inconsistencies may not work on a larger file:

dragwanted = 1.0;
fuelwanted = 1.2;
content = fileread('question.txt'); %get whole content of file
sections = regexp(content, 'DRAG FACTOR\s+([0-9.]+)\s+FUEL FACTOR\s+([0-9.]+)\s+([A-Z .]+\r\n[A-Z() ]+\r\n\s*\r\n([0-9. ]+\r\n)+)', 'tokens');
%sections is a cell array of 1x3 cell arrays of {drag factor, fuel factor, table}
dragfactors = cellfun(@(s) str2double(s{1}), sections);
fuelfactors = cellfun(@(s) str2double(s{2}), sections);
wanted = dragfactors == dragwanted & fuelfactors == fuelwanted;
assert(sum(wanted) > 0, 'No section match criteria');
assert(sum(wanted) == 1, 'More than one section match criteria');
section = sections{wanted}{3};
%parse the section:
sectionlines = strsplit(section, {'\n', '\r'});
sectionheader = strsplit(strtrim(sectionlines{1}))
sectionunits = strtrim(regexp(sectionlines{2}, '(?<=\().*?(?=\))', 'match'))
sectiontable = str2num(strjoin(sectionlines(4:end-1), '\n'))

6 Comments
Show 4 older commentsHide 4 older comments

yashvin on 10 Jun 2015

Now I am understanding it better thanks to you! So, in fact, the list of condition before the table can be any one of them. Infact, it can also be CG location percentage, altitude value, ISA number(positive or negative),cost index value or % of MCR thrust.

In the file, in each sections, we care only from the CLEAN CONFIGURATION to the last value of the table. The remaining can be discarded.

The table always start by WGHT and the header stays same. Yes, the unit should be kept.

Thanks Yashvin

Guillaume on 10 Jun 2015

Open in MATLAB Online

Your file is a real mess, sometimes you have empty lines with just one space, sometimes with no spaces, the header line starts with 3 spaces, the unit line only two, the parameter section sometimes has one parameter on a line, sometimes two. You may be better off parsing the file line by line.

Otherwise, the following will get you the table and the criteria section, but will not parse the criteria:

sections = regexp(content, ...
    'CLEAN CONFIGURATION\r\n((.*\r\n)+?)(\s+WGHT.*\r\n.*\r\n.*\r\n([0-9. ]+\r\n)+)', ...
    'tokens', 'dotexceptnewline);

sections is a 1 x n (n = number of section) cell array of cell arrays whose first elements are the criteria part and seconds elements the table part. You can parse the table with the same code as before. For reference, the above regular expression can be decoded as:

match 'CLEAN CONFIGURATION' followed by '\r' (newline)
starts the first token (at |(|)
match any character but a newline followed by '\r' (the |(.*\r

Sign in to comment.

Parsing a text file in matlab and accessing contents of each sections

2 Comments
Show NoneHide None

Accepted Answer

9 Comments
Show 7 older commentsHide 7 older comments

More Answers (1)

6 Comments
Show 4 older commentsHide 4 older comments

See Also

Categories

Tags

Community Treasure Hunt

Parsing a text file in matlab and accessing contents of each sections

2 Comments Show NoneHide None

Accepted Answer

9 Comments Show 7 older commentsHide 7 older comments

More Answers (1)

6 Comments Show 4 older commentsHide 4 older comments

See Also

Categories

Tags

Community Treasure Hunt

2 Comments
Show NoneHide None

9 Comments
Show 7 older commentsHide 7 older comments

6 Comments
Show 4 older commentsHide 4 older comments