Reading from a variable length text file

13 views (last 30 days)
sushmita das
sushmita das on 21 Feb 2017
Commented: dpb on 4 Apr 2017
I have a text file with variable length columns separated by a comma.How can I retrieve the data using Matlab.
newDataLine = textscan(fid, '%s %d%*[^\n]', 'Delimiter',',');
This is what I tried so far. Can't figure a solution. Thank You. Best Regards, Sushmita
  10 Comments
dpb
dpb on 4 Apr 2017
Edited: dpb on 4 Apr 2017
Why can't you use one of the previous examples given that either reads in groups of actual number of fields in groups (if that's what the data really are) or as the last shows, using the longest number of records and filling in with missing value (default NaN)?
I've shown above how to read the file and determine the number of delimiters; the number of records is that plus one.
If you don't have the information outside the file and won't (or from some reason can't) write it as the first record in the file, reading the file to determine what it is is about the only choice left(*) outside the crystal ball route to automatagically read the file without user intervention with something like uiimport or the like which just hides that the above or something very similar is what it has done to discover the answer.
As the last Answer posted shows, the actual code required is almost trivial once you know the magic number but the magic number isn't really magic.
(*) I guess there's one other alternative albeit kinda' klunky and prone to error would be to write a companion file that holds the record size info for the main file. This path is rife with the obvious issues, of course but if you're adamant about not otherwise making the information known, "any port in a storm".
dpb
dpb on 4 Apr 2017
"...the no. of fields are 16/17."
??? If you did eliminate the trailing delimiter on the two offending records, then since the only two records with 16 delimiters were those two, the new maximum would be 15. That implies maximum of 16 fields, not 17. Is that not intended result? If not, we're still indeterminate on the rules to be used.

Sign in to comment.

Answers (4)

dpb
dpb on 27 Feb 2017
Edited: dpb on 28 Feb 2017
importdata is almost smart enough--
>> dat=importdata('sushim.csv');
dat =
data: [8x14 double]
textdata: {9x1 cell}
>> dat.data
ans =
1.0e+04 *
2.0000 2.0000 0 0.0034 0.0163 0.0092 0.0461 0.0095 0.0373 0.0284 0.0350 0.0067 0 0
2.0000 2.0000 0 0.0034 0.0163 0.0095 0.0461 0.0094 0.0373 0.0287 0.0350 0.0067 0 0
2.0000 2.0000 0 0.0034 0.0163 0.0094 0.0461 0.0091 0.0373 0.0279 0.0350 0.0067 0 0
2.0000 2.0000 0.0000 0.0034 0.0163 0.0092 0.0461 0.0093 0.0373 0.0274 0.0350 0.0082 0.0045 0.0061
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2.0000 2.0000 0.0000 0.0034 0.0163 0.0094 0.0461 0.0094 0.0373 0.0278 0.0350 0.0076 0.0045 0.0061
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1.8182 2.0000 0.0001 0.0034 0.0163 0.0094 0.0461 0.0095 0.0373 0.0278 0.0350 0.0076 0.0045 0.0061
>> data=dat.data(~any(isnan(dat.data),2),:)
data =
1.0e+04 *
2.0000 2.0000 0 0.0034 0.0163 0.0092 0.0461 0.0095 0.0373 0.0284 0.0350 0.0067 0 0
2.0000 2.0000 0 0.0034 0.0163 0.0095 0.0461 0.0094 0.0373 0.0287 0.0350 0.0067 0 0
2.0000 2.0000 0 0.0034 0.0163 0.0094 0.0461 0.0091 0.0373 0.0279 0.0350 0.0067 0 0
2.0000 2.0000 0.0000 0.0034 0.0163 0.0092 0.0461 0.0093 0.0373 0.0274 0.0350 0.0082 0.0045 0.0061
2.0000 2.0000 0.0000 0.0034 0.0163 0.0094 0.0461 0.0094 0.0373 0.0278 0.0350 0.0076 0.0045 0.0061
1.8182 2.0000 0.0001 0.0034 0.0163 0.0094 0.0461 0.0095 0.0373 0.0278 0.0350 0.0076 0.0045 0.0061
>> dat.textdata
ans =
'09:45:06'
'09:48:11'
'09:51:16'
'09:54:26'
'61'
'09:57:33'
'51'
'10:00:47'
'51'
>>
What you see is that the final field for the longer records are attempted to be interpreted as the next date field and then when that fails the format established for the first column earlier, the rest of the equivalent record is filled with NaN.
One could reconstruct the correct file by attaching those values from the text field to the record prior to the preceding NaN record.
I've got another commitment; gotta' run just now, but that looks to be workable solution if not exactly clean.
The alternative I think would be to read each record as text and count the delimiters before converting. Or, read the whole file as cellstr array and augment the shorter records to match then convert the whole thing.
The better solution to either would probably be to fix the file-generation process to produce a consistent file format to begin with.
ADDENDUM To the last suggestion--or at least write the number of records and record size in each group to the file before each set of differing-lengths. Then you could read that record, build the proper format string dynamically and read the subsequent group. This could then easily be encapsulated in a loop enclosing the sequence to read the entire file.

dpb
dpb on 1 Mar 2017
Edited: dpb on 2 Mar 2017
Will show another approach based on previous comments...
>> data=textread('sushim.csv','%s','delimiter','\n','whitespace',''); % return as cellstr array
>> n=cellfun(@length,strfind(data,',')).' % count delimiters/line
n =
14 14 14 15 15 15
>> N=find(diff(n)) % number per group same
N =
3
>> fid=fopen('sushim.csv'); % preliminary info collected; now we can read
>> for i=1:length(unique(n)) % number of groups
fmt=['%s' repmat('%f',1,u(i))]; % build column-specific format string for group
c(i,:)=textscan(fid,fmt,3,'delimiter',',','collectoutput',1); % and read the file
end
>> fid=fclose(fid); % done with file handle/close file
>> c % and what did we get??? A group for each length of columns
c =
{3x1 cell} [3x14 double]
{3x1 cell} [3x15 double]
>> c{:,1} % date column just for grins...comes out as comma-separated list
ans =
'09:45:06'
'09:48:11'
'09:51:16'
ans =
'09:54:26'
'09:57:33'
'10:00:47'
>>
This reads the file twice't, unfortunately, but need to know the format to parse it. One could use the data in memory from the first read but textscan isn't cellstr-literate so to do so one would have to loop for each line. For large files this still might be faster, but will be left as "exercise for the student"...although it's basically just a loop over size(data,1)

Rob Jacobs
Rob Jacobs on 3 Apr 2017
Another option is to open this file in the Import Tool. It detects the extra column, and fills in with NaN's for the missing data.
>> uiimport('test.txt')
You can import from there directly, or select to generate a script or function to do the import.

dpb
dpb on 3 Apr 2017
Edited: dpb on 3 Apr 2017
Yet another...
nColTot=16; % either known a priori or from scanning file first
fmt=['%s' repmat('%f',1,N-1)]; % format string--1 string column rest floats
fid=fopen('yourfile.csv','r'); % open file
data=textscan(fid,fmt,'Delimiter',',','CollectOutput',1);
fid=fclose(fid);
Alternatively, if have recent release with the '%D' datetime format string supported, go ahead and convert times on input--
>> fmt=['%{HH:mm:ss}D' repmat('%f',1,15)];
>> frewind(fid)
>> dat=textscan(fid, fmt, 'Delimiter', ',','collectoutput',true);
>> dat(1)
ans =
[6x1 datetime]
>> dat{1}
ans =
09:45:06
09:48:11
09:51:16
09:54:26
09:57:33
10:00:47
>>
may be more convenient.
The key is somewhere need to know the maximum number of fields so textscan can figure out what are missing fields to fill in with missing value if the file doesn't have the specific delimiters to indicate same...
ADDENDUM
I see there's also date as well as time in the actual file; in that case using %D really does have an advantage although the format should then include the date
fmt=['%{dd-MMM-yyyy HH:mm:ss}D' repmat('%f',1,15)];
NB: that in contrast to other formatting strings, the format string for the D date descriptor is a cellstring not a sequence of characters. This is required; it will error otherwise.

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!