Reading from a variable length text file

Question

sushmita das on 21 Feb 2017

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/326157-reading-from-a-variable-length-text-file

Commented: dpb on 4 Apr 2017

test.txt

Open in MATLAB Online

I have a text file with variable length columns separated by a comma.How can I retrieve the data using Matlab.

newDataLine = textscan(fid, '%s %d%*[^\n]', 'Delimiter',',');

This is what I tried so far. Can't figure a solution. Thank You. Best Regards, Sushmita

10 Comments
Show 8 older commentsHide 8 older comments

dpb on 4 Apr 2017

Open in MATLAB Online

I looked at your file; it's got another "issue"...

>> d=textread('test.txt','%s','delimiter','\n');  % read as cellstr
>> n=strfind(d,',');              % find the delimiters
>> l=cellfun(@length,n);          % count number per record
>> max(cellfun(@length,n))        % what's the max fiels/record
ans =
  16
>> find(l==ans)
ans =
  18
  28
>> d(18:28)
ans = 
  '03-Feb-2017 04:25:56,137,138,0,2.178000e+01,162,0,0,0,0,0,350,83,45,71,71,'
  '03-Feb-2017 04:26:35,132,136,0,2.178000e+01,162,0,131,0,46,0,350,83,45,71,71'
  '03-Feb-2017 04:28:11,143,136,0,2.178000e+01,162,0,159,0,45,0,350,67,85,30'
  '03-Feb-2017 04:29:48,126,135,0,2.178000e+01,162,0,0,0,0,0,350,67,85,30'
  '03-Feb-2017 04:29:48,126,135,0,2.178000e+01,162,0,0,0,0,0,350,67,85,30'
  '03-Feb-2017 04:31:09,149,135,0,2.178000e+01,162,0,0,0,0,0,350,67,85,30'
  '03-Feb-2017 04:32:49,134,137,0,2.178000e+01,162,0,0,0,0,0,350,67,85,30'
  '03-Feb-2017 04:35:51,140,142,0,2.178000e+01,162,0,108,0,22,0,350,67,85,30'
  '03-Feb-2017 04:37:09,137,143,0,2.178000e+01,162,0,121,0,43,0,350,83,45,85,71'
  '03-Feb-2017 04:38:25,146,144,0,2.178000e+01,162,0,107,0,40,0,350,83,45,85,71'
  '03-Feb-2017 04:38:49,153,145,0,2.178000e+01,162,0,0,0,0,0,350,83,45,85,71,'
  '03-Feb-2017 
>>

Now "Houston, we have a problem!" There are two records with a trailing delimiter. So, now how many fields are there really supposed to be, total??? Dunno...is that a missing field after the last comma there or is it not?

dpb on 4 Apr 2017

Edited: dpb on 4 Apr 2017

Why can't you use one of the previous examples given that either reads in groups of actual number of fields in groups (if that's what the data really are) or as the last shows, using the longest number of records and filling in with missing value (default NaN)?

I've shown above how to read the file and determine the number of delimiters; the number of records is that plus one.

If you don't have the information outside the file and won't (or from some reason can't) write it as the first record in the file, reading the file to determine what it is is about the only choice left(*) outside the crystal ball route to automatagically read the file without user intervention with something like uiimport or the like which just hides that the above or something very similar is what it has done to discover the answer.

As the last Answer posted shows, the actual code required is almost trivial once you know the magic number but the magic number isn't really magic.

(*) I guess there's one other alternative albeit kinda' klunky and prone to error would be to write a companion file that holds the record size info for the main file. This path is rife with the obvious issues, of course but if you're adamant about not otherwise making the information known, "any port in a storm".

dpb on 4 Apr 2017

"...the no. of fields are 16/17."

??? If you did eliminate the trailing delimiter on the two offending records, then since the only two records with 16 delimiters were those two, the new maximum would be 15. That implies maximum of 16 fields, not 17. Is that not intended result? If not, we're still indeterminate on the rules to be used.

Sign in to comment.

Sign in to answer this question.

Answer 1

dpb on 27 Feb 2017

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/326157-reading-from-a-variable-length-text-file#answer_256550

Edited: dpb on 28 Feb 2017

Open in MATLAB Online

importdata is almost smart enough--

>> dat=importdata('sushim.csv');
dat = 
      data: [8x14 double]
  textdata: {9x1 cell}
>> dat.data
ans =
 1.0e+04 *
  2.0000    2.0000         0    0.0034    0.0163    0.0092    0.0461    0.0095    0.0373    0.0284    0.0350    0.0067         0         0
  2.0000    2.0000         0    0.0034    0.0163    0.0095    0.0461    0.0094    0.0373    0.0287    0.0350    0.0067         0         0
  2.0000    2.0000         0    0.0034    0.0163    0.0094    0.0461    0.0091    0.0373    0.0279    0.0350    0.0067         0         0
  2.0000    2.0000    0.0000    0.0034    0.0163    0.0092    0.0461    0.0093    0.0373    0.0274    0.0350    0.0082    0.0045    0.0061
     NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN
  2.0000    2.0000    0.0000    0.0034    0.0163    0.0094    0.0461    0.0094    0.0373    0.0278    0.0350    0.0076    0.0045    0.0061
     NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN
  1.8182    2.0000    0.0001    0.0034    0.0163    0.0094    0.0461    0.0095    0.0373    0.0278    0.0350    0.0076    0.0045    0.0061
>> data=dat.data(~any(isnan(dat.data),2),:)
data =
 1.0e+04 *
  2.0000    2.0000         0    0.0034    0.0163    0.0092    0.0461    0.0095    0.0373    0.0284    0.0350    0.0067         0         0
  2.0000    2.0000         0    0.0034    0.0163    0.0095    0.0461    0.0094    0.0373    0.0287    0.0350    0.0067         0         0
  2.0000    2.0000         0    0.0034    0.0163    0.0094    0.0461    0.0091    0.0373    0.0279    0.0350    0.0067         0         0
  2.0000    2.0000    0.0000    0.0034    0.0163    0.0092    0.0461    0.0093    0.0373    0.0274    0.0350    0.0082    0.0045    0.0061
  2.0000    2.0000    0.0000    0.0034    0.0163    0.0094    0.0461    0.0094    0.0373    0.0278    0.0350    0.0076    0.0045    0.0061
  1.8182    2.0000    0.0001    0.0034    0.0163    0.0094    0.0461    0.0095    0.0373    0.0278    0.0350    0.0076    0.0045    0.0061
>> dat.textdata
ans = 
  '09:45:06'
  '09:48:11'
  '09:51:16'
  '09:54:26'
  '61'
  '09:57:33'
  '51'
  '10:00:47'
  '51'
>>

What you see is that the final field for the longer records are attempted to be interpreted as the next date field and then when that fails the format established for the first column earlier, the rest of the equivalent record is filled with NaN.

One could reconstruct the correct file by attaching those values from the text field to the record prior to the preceding NaN record.

I've got another commitment; gotta' run just now, but that looks to be workable solution if not exactly clean.

The alternative I think would be to read each record as text and count the delimiters before converting. Or, read the whole file as cellstr array and augment the shorter records to match then convert the whole thing.

The better solution to either would probably be to fix the file-generation process to produce a consistent file format to begin with.

ADDENDUM To the last suggestion--or at least write the number of records and record size in each group to the file before each set of differing-lengths. Then you could read that record, build the proper format string dynamically and read the subsequent group. This could then easily be encapsulated in a loop enclosing the sequence to read the entire file.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 2

dpb on 1 Mar 2017

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/326157-reading-from-a-variable-length-text-file#answer_256897

Edited: dpb on 2 Mar 2017

Open in MATLAB Online

Will show another approach based on previous comments...

>> data=textread('sushim.csv','%s','delimiter','\n','whitespace','');  % return as cellstr array
>> n=cellfun(@length,strfind(data,',')).'                              % count delimiters/line
n =
  14    14    14    15    15    15
>> N=find(diff(n))                                                     % number per group same
N =
   3
>> fid=fopen('sushim.csv');                 % preliminary info collected; now we can read
>> for i=1:length(unique(n))                % number of groups
    fmt=['%s' repmat('%f',1,u(i))];         % build column-specific format string for group
    c(i,:)=textscan(fid,fmt,3,'delimiter',',','collectoutput',1);  % and read the file
   end
>> fid=fclose(fid);                         % done with file handle/close file
>> c   % and what did we get???  A group for each length of columns
c = 
  {3x1 cell}    [3x14 double]
  {3x1 cell}    [3x15 double]
>> c{:,1}   % date column just for grins...comes out as comma-separated list
ans = 
  '09:45:06'
  '09:48:11'
  '09:51:16'
ans = 
  '09:54:26'
  '09:57:33'
  '10:00:47'
>>

This reads the file twice't, unfortunately, but need to know the format to parse it. One could use the data in memory from the first read but textscan isn't cellstr-literate so to do so one would have to loop for each line. For large files this still might be faster, but will be left as "exercise for the student"...although it's basically just a loop over size(data,1)

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 3

Rob Jacobs on 3 Apr 2017

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/326157-reading-from-a-variable-length-text-file#answer_261538

Open in MATLAB Online

Another option is to open this file in the Import Tool. It detects the extra column, and fills in with NaN's for the missing data.

>> uiimport('test.txt')

You can import from there directly, or select to generate a script or function to do the import.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 4

dpb on 3 Apr 2017

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/326157-reading-from-a-variable-length-text-file#answer_261558

Edited: dpb on 3 Apr 2017

Open in MATLAB Online

Yet another...

nColTot=16;   % either known a priori or from scanning file first
fmt=['%s' repmat('%f',1,N-1)];     % format string--1 string column rest floats
fid=fopen('yourfile.csv','r');     % open file
data=textscan(fid,fmt,'Delimiter',',','CollectOutput',1);
fid=fclose(fid);

Alternatively, if have recent release with the '%D' datetime format string supported, go ahead and convert times on input--

>> fmt=['%{HH:mm:ss}D' repmat('%f',1,15)];
>> frewind(fid)
>> dat=textscan(fid, fmt, 'Delimiter', ',','collectoutput',true);
>> dat(1)
ans = 
  [6x1 datetime]
>> dat{1}
ans = 
 09:45:06
 09:48:11
 09:51:16
 09:54:26
 09:57:33
 10:00:47
>>

may be more convenient.

The key is somewhere need to know the maximum number of fields so textscan can figure out what are missing fields to fill in with missing value if the file doesn't have the specific delimiters to indicate same...

ADDENDUM

I see there's also date as well as time in the actual file; in that case using %D really does have an advantage although the format should then include the date

fmt=['%{dd-MMM-yyyy HH:mm:ss}D' repmat('%f',1,15)];

NB: that in contrast to other formatting strings, the format string for the D date descriptor is a cellstring not a sequence of characters. This is required; it will error otherwise.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Reading from a variable length text file

10 Comments
Show 8 older commentsHide 8 older comments

Answers (4)

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

Reading from a variable length text file

10 Comments Show 8 older commentsHide 8 older comments

Answers (4)

0 Comments Show -2 older commentsHide -2 older comments

0 Comments Show -2 older commentsHide -2 older comments

0 Comments Show -2 older commentsHide -2 older comments

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

10 Comments
Show 8 older commentsHide 8 older comments

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments