Reading in ascii files with white space as delimiter.

Question

James Russell on 9 Nov 2015

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/253940-reading-in-ascii-files-with-white-space-as-delimiter

Edited: dpb on 13 Nov 2015

I am trying to read in a very simple ascii file that looks like the following:

   PRES   HGHT   TEMP   DWPT   RELH   MIXR   DRCT   SKNT   THTA   THTE   THTV
    hPa     m      C      C      %    g/kg    deg   knot     K      K      K 
-----------------------------------------------------------------------------
  994.0    270    7.0    6.0     93   5.93     40     10  280.6  297.1  281.6
  989.0    312    6.2    5.2     93   5.64     42     12  280.2  295.9  281.2
  972.0    455    4.8    4.0     95   5.27     48     18  280.2  294.9  281.1
  ...

There seem to be a dozen functions that I can read this in with but I'm struggling with all of them.

The simplest seems to be dlmread. I'm currently using the command:

M = dlmread('radiosonde.ascii',' ',3,1)

However this seems to register a single space as the delimiter instead of all the white space. If I use:

M = dlmread('radiosonde.ascii')

It registers the white space as the delimiter but I cannot specify to ignore the headers. Is there some way to specify white space as the delimitter while also ignoring the headers?

Is there a better way to do this? Why hasn't Mathworks streamlined reading text files to be one universal function?

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Star Strider on 9 Nov 2015

1
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/253940-reading-in-ascii-files-with-white-space-as-delimiter#answer_199103

Edited: Star Strider on 9 Nov 2015

Open in MATLAB Online

The dlmread function digests only numeric data so it will have problems with the strings.

I would use the textscan function:

fidi = fopen('radiosonde.ascii','rt');
D = textscan(fidi, repmat('%f',1,11), 'Delimiter',' ', 'MultipleDelimsAsOne',true, 'HeaderLines',3, 'CollectOutput',true);

You might need other name-value pair agruments, but this should get you started. The repmat call creates the input format string for the numerical data.

2 Comments
Show NoneHide None

dpb on 13 Nov 2015

Edited: dpb on 13 Nov 2015

Open in MATLAB Online

Actually, as noted in the follow on from the other thread of same subject http://www.mathworks.com/matlabcentral/answers/253939-reading-in-ascii-files-with-white-space-as-delimiter#comment_322672, if one uses an empty field for the format string, apparently textscan internally counts fields per record and automagically returns the right shape (at least for regular files such as this). And, neither specific 'Delimiter' nor the 'MultipleDelimsAsOne' fields are needed for the default white space. From the doc--"White space can be any combination of space (' '), backspace ('\b'), or tab ('\t') characters. If you do not specify a delimiter, textscan interprets repeated white-space characters as a single delimiter."

Consequently, all that's really needed for this specific file is

D = textscan(fid, '', 'HeaderLines',3, 'CollectOutput',true);

if you're satisfied with the cell array returned (which is why I almost always wrap the textscan call inside cell2mat or use textread instead).

I spent some time this morning following up on my observation from yesterday and so far I find the behavior with an empty format string mentioned nowhere in the documentation. It's a key piece of knowledge that can help a bunch but isn't made known.

PS. I followed up from the result of the other thread by submitting an enhancement request for dlmread and friends and provided the suggested patch that doesn't change the interface at all. That it will get accepted I have little hope, but it is at least on TMW radar. I just provided the Tech Support rep who responded to the request with the observation here that the "feature" appears undocumented regarding the behavior with empty formatting string; also whether that'll make it into future doc remains to be seen. It seems to be a supported behavior given TMW relies on it in dlmread (and csvread is simply a wrapper around dlmread) and likely elsewhere.

dpb on 13 Nov 2015

Edited: dpb on 13 Nov 2015

Open in MATLAB Online

"The dlmread function digests only numeric data so it will have problems with the strings."

It's not documented to work, correct, but it's also not guar-on-teed to fail...

>> type jr.txt
     PRES   HGHT   TEMP   DWPT   RELH   MIXR   DRCT   SKNT   THTA   THTE   THTV
      hPa     m      C      C      %    g/kg    deg   knot     K      K      K 
  -----------------------------------------------------------------------------
    994.0    270    7.0    6.0     93   5.93     40     10  280.6  297.1  281.6
    989.0    312    6.2    5.2     93   5.64     42     12  280.2  295.9  281.2
    972.0    455    4.8    4.0     95   5.27     48     18  280.2  294.9  281.1
>> dlmread('jr.txt',' ',3,1)  % explicit blank delimiter
ans =
  Columns 1 through 9
         0  994.0000         0         0         0  270.0000         0         0         0
         0  989.0000         0         0         0  312.0000         0         0         0
         0  972.0000         0         0         0  455.0000         0         0         0
 ...

gets the data but with the problem of the blank fields. That wouldn't be so bad if it did NaN infill instead of zero; then be at least a reasonable shot one could just remove all columns with full complement of NaN and get the desired result.

The text header didn't seem to cause any problem; apparently because there are no embedded blanks or other oddities in the first header line so the count of possible delimiters isn't fouled up. Again, it's a bonus it works and shouldn't be relied on in general but worth noting the behavior methinks.

I went ahead here and did the unrecommended thing of making a patch in the TMW-supplied dlmread function as it seems harmless at worst and beneficial in general and I'm willing to accept that if it breaks it's my fault...

>> dlmread('jr.txt',[],3,0)  % the modified version
s =
Columns 1 through 9
994.0000  270.0000    7.0000    6.0000   93.0000    5.9300   40.0000   10.0000  280.6000
989.0000  312.0000    6.2000    5.2000   93.0000    5.6400   42.0000   12.0000  280.2000
972.0000  455.0000    4.8000    4.0000   95.0000    5.2700   48.0000   18.0000  280.2000
...
>>

And we see "magic has occurred"...

Looks to me like dlmread needs a redesign/rewrite -- why shouldn't the offset row count be used in the preliminary stage of delimiter auto-detection? That seems only reasonable that if one asks to ignore an area of the file to do so. The change from the traditional interface to named parameters here would be a.good.thing(tm)

Sign in to comment.

Reading in ascii files with white space as delimiter.

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

2 Comments
Show NoneHide None

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

Reading in ascii files with white space as delimiter.

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

2 Comments Show NoneHide None

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None