Reading in data with spaces

37 views (last 30 days)
Art Hartzell
Art Hartzell on 5 Aug 2015
Edited: Cedric on 6 Aug 2015
I have a bunch of data with spaces in it that is fixed width. It is formatted such that each "column" is a fixed number of characters wide (including space characters). I've been trying to use "textscan" with a format like this: '%6c %6c %6c %6c %6c' but it seems to A)ignore spaces a the beginning of the string (this serves to throw off the 'count' when the first number goes from 9 to 10 for example) and B)not recognize blanks in the middle or at the ends. eg: "___8.5|___9.2|______|______|___7.6" where "_" represents one space and "|" are added for clarity. I need this to read in with 8.5 in the first cell, 9.2 in the second, and 7.6 in the 5th cell. I can tolerate it being a string (and converting later) as long as the column placement is preserved.
  1 Comment
dpb
dpb on 5 Aug 2015
Edited: dpb on 5 Aug 2015
You've stumbled into the black hole of C (and hence Matlab) fixed-width, non-delmited input; there's no way to so with a missing field other than by reading as a character array and parsing the lines using counted string indexing--no input scanning format will honor the blanks.
See the preceding discussion on the subject for more than you really want to know of the issues that can arise...
As noted many times before, I've begged for TMW to provide a solution for 20+ yr to no avail.
Oh, I see you also did a search and came across the old (ca. 2009) TMW post apologizing for the limitation and demonstrating the fixed-width parsing.
I'd urge to submit an enhancement request and add another "voice in the wilderness" to try to get something done (but wouldn't hold my breath waiting... :( )

Sign in to comment.

Accepted Answer

Cedric
Cedric on 5 Aug 2015
Edited: Cedric on 5 Aug 2015
Here is a quick fix following a different path from what is mentioned in Dpb's thread. Assume that we have test.txt with the following content (attached):
1 9.3 8 7
12.4 2
Then running
buffer = fileread( 'test.txt' ) ;
buffer = regexprep( buffer, ' {6}', ' NaN' ) ;
data = reshape( sscanf( buffer, '%f' ), 5, [] )' ;
produces
>> data
data =
1.0000 9.3000 NaN 8.0000 7.0000
NaN 12.4000 2.0000 NaN NaN
  7 Comments
dpb
dpb on 5 Aug 2015
I'm such a klutz w/ regexp I let the whizards with it that know the syntax make such suggestions.
Cedric, that way for substituting/inserting the delimiter is a new one to me; I'll save that in my goodies file for use in future; that is a very useful trick. Can it be expanded to handle an arbitrary set of field widths and number of fields I presume?
In the (now getting to be distant) past and previous life I had written a Fortran mex file that took a FORMAT string and passed it to the Fortran i/o library to read the formatted file/buffer. At one point back in the early days when I first requested the enhancement I supplied that as a model to TMW for their consideration. Unfortunately, with the retirement and move back to the farm the (turns out only) source for that was inadvertently left on the machine at the former employer's site and by the time I discovered same it (the machine) had been repurposed and wiped clean.
I started some months ago to try to recreate it and got so far as to finally get a working environment to build Fortran mex files again, but then spring intervened and that's had to be put on hold for this farming season until after fall harvest now before will have time to try to delve into it again. The idea was to make a FEX submittal that would solve a large fraction of these cases; to the best of my recollection, the previous incarnation was relatively flexible but had some significant "issues" on totally general FORMAT statements when it came to such things as format recursion and the like in parsing what the output actually was for returning to Matlab workspace. But, it did handle the above case and most of those like it that can be written as a simple set of fixed-width fields.
I still just cannot fathom why this area seems to have been such anathema to TMW to have not solved from the git-go. It's such a shame they didn't use the Fortran FORMAT model instead of going with the C model to begin with, though, altho I understand it was simpler to keep with the tools that went with the primary development language.
Cedric
Cedric on 6 Aug 2015
Edited: Cedric on 6 Aug 2015
Hi dpb,
To be honest, I started that kind of approach as a joke in another thread, a while ago. The question was: how to replace the ~ in a file, e.g.
1|3|2|~|7
2|1|5|~|12
3|2|28|~|137
with data, e.g. data={'John','Tim','Jordan'}. I proposed a one-liner for fun:
>> sprintf( strrep( fileread( 'myFile.txt' ), '~', '%s' ), data{:} )
ans =
1|3|2|John|7
2|1|5|Tim|12
3|2|28|Jordan|137
that made me laugh because we are using the file content as a format spec, and just replace ~ with the formatting operator %s:
1|3|2|%s|7
2|1|5|%s|12
3|2|28|%s|137
so content of the CSL data{:} is inserted in the file content. But then I realized that the alternative is to parse the whole file content and rebuild it after inserting the data, which is not cool if the content is complex. So all in all it happens to be a very practical approach, but I didn't profile it with large files (SPRINTF parsing a 500MB format spec may not be that efficient.. who knows). Anyhow, that lead me to thinking more about "reverse" approaches.
The reason I am using a regexp here is that the behavior of STRREP doesn't allow us to replace blocks of 6 white spaces easily (which may sound strange), whereas the regexp does. To illustrate:
>> regexprep( 'aaaa', 'aa', 'b' )
ans =
bb
>> strrep( 'aaaa', 'aa', 'b' )
ans =
bbb
Surprising, isn't it? The regexp "eats" the string, whereas STRREP finds all possible matches before replacement.
Speaking of regexp, if you were interested, I highly recommend section 2.26 (or 2.23? depending your version of MATLAB) of the MATLAB Programming Fundamentals booklet (available here). It is a very good introduction to regexp, concise, and which covers most of the fundamental principles. It would probably take an hour for you to get most of it, and then, if you click on the tool, you can probably say goodbye to your social life for a moment ;-)

Sign in to comment.

More Answers (0)

Categories

Find more on Characters and Strings in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!