READ TABLE WITH DIFFERENT SETTING INSIDE
Show older comments
hello everyone, i have a problem with a dataset, in the same dataset, the data in the rows are writen in different way, in particular, have different space in the table (first and second screen) and matlab, read only the data written in the first way (with a lot of space from tcoluomns), someone knows the way to read both ?
with small data i'm able to load the file in excell and save, but this is bigger for excell



Accepted Answer
More Answers (2)
Emanuele De Astis
on 21 Feb 2021
12 Comments
opt.VariableNames={'Date','Sensor','Value','Activity'};
is only four variable names for five input variables for starters.
Without knowing what is what, what are we supposed to do?
What is wrong with the preceding timetable that was created for a file, specifically? Need to resolve what, if anything, is wrong there first before try something new and different.
>> head(MTA)
ans =
8×3 timetable
Time Sensor Value Activity
____________________ ________ _________ _________________
08-Sep-2009 08:47:23 {'M025'} {'ON' } {'R2_Leave_Home'}
08-Sep-2009 08:47:31 {'D001'} {'CLOSE'} {'R2_Leave_Home'}
08-Sep-2009 08:57:50 {'M024'} {'ON' } {'R1_Leave_Home'}
08-Sep-2009 08:57:56 {'M024'} {'OFF' } {'R1_Leave_Home'}
08-Sep-2009 22:59:58 {'M046'} {'ON' } {'R1_Sleep' }
08-Sep-2009 23:05:46 {'M050'} {'ON' } {'R1_Sleep' }
08-Sep-2009 23:32:28 {'M034'} {'ON' } {'R2_Sleep' }
08-Sep-2009 23:38:13 {'M047'} {'ON' } {'R1_Sleep' }
>>
There's nothing numeric in the file for most of those records...there are some half-dozen or so that do have a number in the field, but only a handful in comparison.
That's in the file itself, though, not in reading the file.
>> Values=str2double(MTA.Value);
>> sum(isfinite(Values))
ans =
4
>> MTA(isfinite(Values),:)
ans =
4×3 timetable
Time Sensor Value Activity
____________________ ________ _______ ____________
10-Sep-2009 23:04:13 {'P001'} {'418'} {'R2_Sleep'}
26-Sep-2009 09:00:24 {'P001'} {'457'} {'R2_Sleep'}
27-Sep-2009 08:55:28 {'P001'} {'394'} {'R2_Sleep'}
27-Sep-2009 08:56:18 {'P001'} {'384'} {'R1_Sleep'}
>>
Emanuele De Astis
on 21 Feb 2021
Emanuele De Astis
on 21 Feb 2021
dpb
on 21 Feb 2021
As noted above, that appears to be inherent in the file the way it was built, not the fault of the reading.
Upload a SMALL section of the file that illustrates what you think is a problem, specifically; don't make us have to try to fight through some 20K records looking for what we don't know we're looking for...then explain clearly and succinctly what it is that you think should be versus what is.
You could go a long way towards solving such issues if you would create the files as comma or semicolon delimited and with quoted strings instead of blank/tab-delimited with unquoted strings.
That's the biggest part of your problem--a poorly constructed file format.
Emanuele De Astis
on 21 Feb 2021
dpb
on 21 Feb 2021
Well, it may be "small" in comparison to the real data set, but I don't consider 138,022 records small for we volunteers in the forum to have to wade through to try to help, sorry.
I went to the trouble to pare the file down to some 25 records or so that showed the disparity in spacing you referred to in the orginal question and showed how to read it consistently -- now you've apparently changed the ground rules of what the problem may be.
If so, provide a similar SMALL file that illustrates precisely what it is that you're after -- although from what I can gather from what you've said, the problem is inherent in the way the file was written and can't be solved trivially if you mean that the records which have a numeric value as shown above are supposed to have been written with a missing value where a sensor ID belongs and another column for whatever that numeric field represents.
The file simply wasn't written that way, from what I can tell easily with what have to work with without having to spend my time parsing out what it is that is intended to look at from the file.
"Help us help you!!!"
OK, here's the proof of the problem--
>> fid=fopen('dataprova.txt','r');
>> fmt=['#tabs=%d, Record: %s \n'];
>> for i=1:12315,fgetl(fid);end % skip over beginning records
>> for i=1:15,l=fgetl(fid);fprintf(fmt,sum(l==9),l);end % read next 15, analyze content
#tabs=2, Record: 2009-09-09 01:47:16.062345 M046 ON R1_Sleep begin
#tabs=2, Record: 2009-09-09 01:47:17.073655 M047 ON
#tabs=2, Record: 2009-09-09 01:47:19.086398 M047 OFF
#tabs=2, Record: 2009-09-09 01:47:23.007103 M046 OFF
#tabs=2, Record: 2009-09-09 01:48:14.008333 M047 ON
#tabs=2, Record: 2009-09-09 01:48:18.052422 M047 OFF
#tabs=2, Record: 2009-09-09 01:48:36.043365 M047 ON
#tabs=2, Record: 2009-09-09 01:48:37.073454 M047 OFF
#tabs=2, Record: 2009-09-09 01:48:41.088045 M047 ON
#tabs=2, Record: 2009-09-09 01:48:44.008428 M047 OFF
#tabs=2, Record: 2009-09-09 01:48:49.077298 P001 1001.6
#tabs=2, Record: 2009-09-09 01:48:49.082215 P001 370
#tabs=2, Record: 2009-09-09 01:48:54.053609 M047 ON
#tabs=2, Record: 2009-09-09 01:49:00.060148 M047 OFF
#tabs=2, Record: 2009-09-09 01:51:11.006981 T004 24.5
>> fid=fclose(fid);
>>
The above shows the records such as these have only two tabs regardless what the data type of the columns are; there's no way a priori on reading the file to put those two types of data into a different variable.
This isn't the fault of reading the data, it's that the data file is poorly constructed if it is intended/needed to separate those two out.
Of course, I showed you can easily-enough find those records--string2double() will return NaN for those that aren't valid numerals.
Expecting anything more is going to have to happen by either fixing the input file format or writing code to do whatever it is that is wanted to be done with the malformed files you have.
Emanuele De Astis
on 21 Feb 2021
You're expecting the impossible...of creating information where it doesn't exist.
As the above shows, there SIMPLY IS NO DIFFERENCE in the two records of whether there is/is not a numeric value in the third variable position. PERIOD. END OF STORY. WISHING IT WERE OTHERWISE WON'T CHANGE FACTS.
If you want/need those values as some other variable, you'll have to create that as I showed how to find where they are and make a new variable of those values.
Or, more better, get whoever created these files to fix their output formatting.
However, I'll note that the routine I used AS I WROTE IT WITHOUT THE CHANGES YOU MADE will and does read the comment field separately from the action field reliably. See the prova timetable at the Answer above that shows Var3, Var4, Var5 are all separated correctly regardless of the content of Var4 column.
ADDUNDUM:
OBTW, it's even worse--I had seen the following section first:
>> for i=1:15,l=fgetl(fid);fprintf(fmt,sum(l==9),l);end
#tabs=3, Record: 2009-09-06 04:19:31.032325 M049 ON R1_Bed_Toilet_Transition begin,
#tabs=2, Record: 2009-09-06 04:19:33.050316 M049 OFF,
#tabs=2, Record: 2009-09-08 08:12:56.088385 M025 ON,
#tabs=2, Record: 2009-09-08 08:12:58.088943 M024 ON,
#tabs=2, Record: 2009-09-08 08:12:58.091278 M026 OFF,
#tabs=2, Record: 2009-09-08 08:12:59.098453 M021 ON,
#tabs=3, Record: 2009-09-08 08:13:00.082885 M019 ON R2_Meal_Preparation begin,
...
in which you'll note there are 3 tabs for the fields with remarks after the action(?) column. Those are parsed correctly as noted above.
However, when I did another section that has more numbers, then we see:
>> for i=1:12315,fgetl(fid);end
>> for i=1:15,l=fgetl(fid);fprintf(fmt,sum(l==9),l);end
#tabs=2, Record: 2009-09-09 01:47:16.062345 M046 ON R1_Sleep begin,
#tabs=2, Record: 2009-09-09 01:47:17.073655 M047 ON,
and the first output record there is missing the other delimiter as well. You'll have to also find records with longer string there and split it out.
(NB: The trailing comma above is a figment of the format string used--it had a spurious trailing comma in it; it's not real)
Emanuele De Astis
on 21 Feb 2021
dpb
on 21 Feb 2021
I showed that
>> opt=detectImportOptions('dataprova.txt',"ExpectedNumVariables",5,'ReadVariableNames',0,"NumHeaderLines",0);
>> prova=readtable('dataprova.txt',opt);
>> prova=table2timetable(prova(:,3:end),'RowTimes',prova.Var1+prova.Var2)
prova =
23×3 timetable
Time Var3 Var4 Var5
____________________ ________ ________ ____________________________
06-Sep-2009 04:19:31 {'M049'} {'ON' } {'R1_Bed_Toilet_Transition'}
06-Sep-2009 04:19:33 {'M049'} {'OFF' } {0×0 char }
08-Sep-2009 08:12:56 {'M025'} {'ON' } {0×0 char }
08-Sep-2009 08:12:58 {'M024'} {'ON' } {0×0 char }
08-Sep-2009 08:12:58 {'M026'} {'OFF' } {0×0 char }
08-Sep-2009 08:12:59 {'M021'} {'ON' } {0×0 char }
08-Sep-2009 08:13:00 {'M019'} {'ON' } {'R2_Meal_Preparation' }
08-Sep-2009 08:13:01 {'M021'} {'OFF' } {0×0 char }
08-Sep-2009 08:13:01 {'M025'} {'OFF' } {0×0 char }
08-Sep-2009 08:13:02 {'M018'} {'ON' } {0×0 char }
15-Sep-2009 13:57:58 {'P001'} {'1619'} {0×0 char }
15-Sep-2009 13:57:58 {'M017'} {'OFF' } {0×0 char }
15-Sep-2009 13:58:00 {'M018'} {'OFF' } {0×0 char }
15-Sep-2009 13:58:01 {'P001'} {'1633'} {0×0 char }
15-Sep-2009 13:58:11 {'M018'} {'ON' } {0×0 char }
15-Sep-2009 13:58:15 {'P001'} {'1623'} {0×0 char }
15-Sep-2009 13:58:16 {'M019'} {'OFF' } {0×0 char }
16-Sep-2009 07:10:18 {'M047'} {'OFF' } {0×0 char }
16-Sep-2009 07:10:20 {'M048'} {'ON' } {'R1_Work' }
16-Sep-2009 07:10:20 {'M046'} {'OFF' } {0×0 char }
04-Oct-2009 01:17:02 {'M029'} {'ON' } {0×0 char }
04-Oct-2009 01:17:04 {'M037'} {'ON' } {'R2_Personal_Hygiene' }
04-Oct-2009 01:17:04 {'M036'} {'OFF' } {0×0 char }
>>
Continuing on from there is straightforward-enough...
prova.Properties.VariableNames={'Sensor','Value','Activity'}; % set variable names
prova.State=prova.Value; % create new "State" column
prova.Value=str2double(prova.Value); % convert Value to numeric
prova.State(isfinite(prova.Value))={''}; % set State numeric-->missing
prova.State=categorical(prova.State); % convert to categorical
prova.Sensor=categorical(prova.Sensor);
prova.Activity=categorical(prova.Activity);
prova=prova(:,[1 2 4 3]); % rearrange order for convenience
Above results in:
>> prova
prova =
23×4 timetable
Time Sensor Value State Activity
____________________ ______ _____ ___________ ________________________
06-Sep-2009 04:19:31 M049 NaN ON R1_Bed_Toilet_Transition
06-Sep-2009 04:19:33 M049 NaN OFF <undefined>
08-Sep-2009 08:12:56 M025 NaN ON <undefined>
08-Sep-2009 08:12:58 M024 NaN ON <undefined>
08-Sep-2009 08:12:58 M026 NaN OFF <undefined>
08-Sep-2009 08:12:59 M021 NaN ON <undefined>
08-Sep-2009 08:13:00 M019 NaN ON R2_Meal_Preparation
08-Sep-2009 08:13:01 M021 NaN OFF <undefined>
08-Sep-2009 08:13:01 M025 NaN OFF <undefined>
08-Sep-2009 08:13:02 M018 NaN ON <undefined>
15-Sep-2009 13:57:58 P001 1619 <undefined> <undefined>
15-Sep-2009 13:57:58 M017 NaN OFF <undefined>
15-Sep-2009 13:58:00 M018 NaN OFF <undefined>
15-Sep-2009 13:58:01 P001 1633 <undefined> <undefined>
15-Sep-2009 13:58:11 M018 NaN ON <undefined>
15-Sep-2009 13:58:15 P001 1623 <undefined> <undefined>
15-Sep-2009 13:58:16 M019 NaN OFF <undefined>
16-Sep-2009 07:10:18 M047 NaN OFF <undefined>
16-Sep-2009 07:10:20 M048 NaN ON R1_Work
16-Sep-2009 07:10:20 M046 NaN OFF <undefined>
04-Oct-2009 01:17:02 M029 NaN ON <undefined>
04-Oct-2009 01:17:04 M037 NaN ON R2_Personal_Hygiene
04-Oct-2009 01:17:04 M036 NaN OFF <undefined>
>>
As noted, you'll want to find the 'State' records that also contain the Activity comment and split those off as well; it's just more of similar kind of edit operations to find "who's who in the zoo!" of containing something other than what is supposed to be in the field.
It still would be far better to get the data written correctly to begin with; but if it can't be regenerated from some other format in which it has been save, then something like the above is the option available to you.
Emanuele De Astis
on 21 Feb 2021
0 votes
Categories
Find more on Tables in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!