Reading in data with spaces

Question

Art Hartzell on 5 Aug 2015

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/232500-reading-in-data-with-spaces

Edited: Cedric on 6 Aug 2015

I have a bunch of data with spaces in it that is fixed width. It is formatted such that each "column" is a fixed number of characters wide (including space characters). I've been trying to use "textscan" with a format like this: '%6c %6c %6c %6c %6c' but it seems to A)ignore spaces a the beginning of the string (this serves to throw off the 'count' when the first number goes from 9 to 10 for example) and B)not recognize blanks in the middle or at the ends. eg: "___8.5|___9.2|______|______|___7.6" where "_" represents one space and "|" are added for clarity. I need this to read in with 8.5 in the first cell, 9.2 in the second, and 7.6 in the 5th cell. I can tolerate it being a string (and converting later) as long as the column placement is preserved.

1 Comment
Show -1 older commentsHide -1 older comments

dpb on 5 Aug 2015

Edited: dpb on 5 Aug 2015

You've stumbled into the black hole of C (and hence Matlab) fixed-width, non-delmited input; there's no way to so with a missing field other than by reading as a character array and parsing the lines using counted string indexing--no input scanning format will honor the blanks.

See the preceding discussion on the subject for more than you really want to know of the issues that can arise...

http://www.mathworks.com/matlabcentral/answers/120549-extract-value-from-txt-weird-lay-out

As noted many times before, I've begged for TMW to provide a solution for 20+ yr to no avail.

Oh, I see you also did a search and came across the old (ca. 2009) TMW post apologizing for the limitation and demonstrating the fixed-width parsing.

I'd urge to submit an enhancement request and add another "voice in the wilderness" to try to get something done (but wouldn't hold my breath waiting... :( )

Sign in to comment.

Sign in to answer this question.

Answer 1

Cedric on 5 Aug 2015

1
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/232500-reading-in-data-with-spaces#answer_188370

Edited: Cedric on 5 Aug 2015

Open in MATLAB Online

Here is a quick fix following a different path from what is mentioned in Dpb's thread. Assume that we have test.txt with the following content (attached):

     1   9.3           8     7
        12.4     2

Then running

 buffer = fileread( 'test.txt' ) ;
 buffer = regexprep( buffer, ' {6}', '   NaN' ) ;
 data   = reshape( sscanf( buffer, '%f' ), 5, [] )' ;

produces

 >> data
 data =
    1.0000    9.3000       NaN    8.0000    7.0000
       NaN   12.4000    2.0000       NaN       NaN

7 Comments
Show 5 older commentsHide 5 older comments

Cedric on 5 Aug 2015

Edited: Cedric on 5 Aug 2015

Open in MATLAB Online

My pleasure! Actually, using RESHAPE is not mandatory; we could use TEXTSCAN with a format spec that defines columns:

textscan( buffer, '%f %f %f %f %f' ) ;

and get immediately an array (a cell array of columns). I have been timing multiple approaches in cases that I had to deal with though, and the approach that consists in reading all floats in one shot and reshaping the output afterwards into an array, has often been the most efficient. This would have to be profiled in your specific case (use tic-toc or the profiler for that).

What we do here is first to obtain a column vector with all the floats out of SSCANF, and then to reshape it into the numeric array that we need.

RESHAPE works based on the column-first MATLAB approach, which means that it splits its first input arg. into blocks of length "the number of rows defined by its second input arg.", gives these blocks a column shape, and concatenates them horizontally. Well, a simple example will make it clear:

 >> reshape( 1:10, 2, [] )
 ans =
     1     3     5     7     9
     2     4     6     8    10

Observe that the 1, 2 define the first column (it doesn't "go horizontally"). Here, we pass an empty array as third input arg., which tells reshape that it has to compute by itself the number of columns (based on the total number of elements and the number of rows = 2 that we passed as second input arg.). All that to explain that we cannot organize directly the reshape in a row-first fashion, so we build it column-first and then we transpose the array.

I'm in a rush.. hopefully I didn't make it less clear than before ;-)

dpb on 5 Aug 2015

I'm such a klutz w/ regexp I let the whizards with it that know the syntax make such suggestions.

Cedric, that way for substituting/inserting the delimiter is a new one to me; I'll save that in my goodies file for use in future; that is a very useful trick. Can it be expanded to handle an arbitrary set of field widths and number of fields I presume?

In the (now getting to be distant) past and previous life I had written a Fortran mex file that took a FORMAT string and passed it to the Fortran i/o library to read the formatted file/buffer. At one point back in the early days when I first requested the enhancement I supplied that as a model to TMW for their consideration. Unfortunately, with the retirement and move back to the farm the (turns out only) source for that was inadvertently left on the machine at the former employer's site and by the time I discovered same it (the machine) had been repurposed and wiped clean.

I started some months ago to try to recreate it and got so far as to finally get a working environment to build Fortran mex files again, but then spring intervened and that's had to be put on hold for this farming season until after fall harvest now before will have time to try to delve into it again. The idea was to make a FEX submittal that would solve a large fraction of these cases; to the best of my recollection, the previous incarnation was relatively flexible but had some significant "issues" on totally general FORMAT statements when it came to such things as format recursion and the like in parsing what the output actually was for returning to Matlab workspace. But, it did handle the above case and most of those like it that can be written as a simple set of fixed-width fields.

I still just cannot fathom why this area seems to have been such anathema to TMW to have not solved from the git-go. It's such a shame they didn't use the Fortran FORMAT model instead of going with the C model to begin with, though, altho I understand it was simpler to keep with the tools that went with the primary development language.

Cedric on 6 Aug 2015

Edited: Cedric on 6 Aug 2015

Open in MATLAB Online

Hi dpb,

To be honest, I started that kind of approach as a joke in another thread, a while ago. The question was: how to replace the ~ in a file, e.g.

3|2|~|7
1|5|~|12
2|28|~|137

with data, e.g. data={'John','Tim','Jordan'}. I proposed a one-liner for fun:

 >> sprintf( strrep( fileread( 'myFile.txt' ), '~', '%s' ), data{:} )
 ans =
   1|3|2|John|7
   2|1|5|Tim|12
   3|2|28|Jordan|137

that made me laugh because we are using the file content as a format spec, and just replace ~ with the formatting operator %s:

3|2|%s|7
1|5|%s|12
2|28|%s|137

so content of the CSL data{:} is inserted in the file content. But then I realized that the alternative is to parse the whole file content and rebuild it after inserting the data, which is not cool if the content is complex. So all in all it happens to be a very practical approach, but I didn't profile it with large files (SPRINTF parsing a 500MB format spec may not be that efficient.. who knows). Anyhow, that lead me to thinking more about "reverse" approaches.

The reason I am using a regexp here is that the behavior of STRREP doesn't allow us to replace blocks of 6 white spaces easily (which may sound strange), whereas the regexp does. To illustrate:

 >> regexprep( 'aaaa', 'aa', 'b' )
 ans =
      bb
 >> strrep( 'aaaa', 'aa', 'b' )
 ans =
      bbb

Surprising, isn't it? The regexp "eats" the string, whereas STRREP finds all possible matches before replacement.

Speaking of regexp, if you were interested, I highly recommend section 2.26 (or 2.23? depending your version of MATLAB) of the MATLAB Programming Fundamentals booklet (available here). It is a very good introduction to regexp, concise, and which covers most of the fundamental principles. It would probably take an hour for you to get most of it, and then, if you click on the tool, you can probably say goodbye to your social life for a moment ;-)

Sign in to comment.

Reading in data with spaces

1 Comment
Show -1 older commentsHide -1 older comments

Accepted Answer

7 Comments
Show 5 older commentsHide 5 older comments

More Answers (0)

See Also

Categories

Tags

Products

Community Treasure Hunt

Reading in data with spaces

1 Comment Show -1 older commentsHide -1 older comments

Accepted Answer

7 Comments Show 5 older commentsHide 5 older comments

More Answers (0)

See Also

Categories

Tags

Products

Community Treasure Hunt

1 Comment
Show -1 older commentsHide -1 older comments

7 Comments
Show 5 older commentsHide 5 older comments