READ TABLE WITH DIFFERENT SETTING INSIDE

hello everyone, i have a problem with a dataset, in the same dataset, the data in the rows are writen in different way, in particular, have different space in the table (first and second screen) and matlab, read only the data written in the first way (with a lot of space from tcoluomns), someone knows the way to read both ?
with small data i'm able to load the file in excell and save, but this is bigger for excell

 Accepted Answer

dpb
dpb on 21 Feb 2021
Edited: dpb on 21 Feb 2021
It's a fignewton of looking at the text in the editor or on the screen--some was tab-delimited, some space but there's only one delimiter between columns.
I shortened the first (LH) attachment to only a few lines that contained samples of both...then used detectImportOptions with some guidance and seems to work just fine--
>> opt=detectImportOptions('dataprova.txt',"ExpectedNumVariables",5,'ReadVariableNames',0,"NumHeaderLines",0);
>> prova=readtable('dataprova.txt',opt);
>> prova=table2timetable(prova(:,3:end),'RowTimes',prova.Var1+prova.Var2)
prova =
23×3 timetable
Time Var3 Var4 Var5
____________________ ________ ________ ____________________________
06-Sep-2009 04:19:31 {'M049'} {'ON' } {'R1_Bed_Toilet_Transition'}
06-Sep-2009 04:19:33 {'M049'} {'OFF' } {0×0 char }
08-Sep-2009 08:12:56 {'M025'} {'ON' } {0×0 char }
08-Sep-2009 08:12:58 {'M024'} {'ON' } {0×0 char }
08-Sep-2009 08:12:58 {'M026'} {'OFF' } {0×0 char }
08-Sep-2009 08:12:59 {'M021'} {'ON' } {0×0 char }
08-Sep-2009 08:13:00 {'M019'} {'ON' } {'R2_Meal_Preparation' }
08-Sep-2009 08:13:01 {'M021'} {'OFF' } {0×0 char }
08-Sep-2009 08:13:01 {'M025'} {'OFF' } {0×0 char }
08-Sep-2009 08:13:02 {'M018'} {'ON' } {0×0 char }
15-Sep-2009 13:57:58 {'P001'} {'1619'} {0×0 char }
15-Sep-2009 13:57:58 {'M017'} {'OFF' } {0×0 char }
15-Sep-2009 13:58:00 {'M018'} {'OFF' } {0×0 char }
15-Sep-2009 13:58:01 {'P001'} {'1633'} {0×0 char }
15-Sep-2009 13:58:11 {'M018'} {'ON' } {0×0 char }
15-Sep-2009 13:58:15 {'P001'} {'1623'} {0×0 char }
15-Sep-2009 13:58:16 {'M019'} {'OFF' } {0×0 char }
16-Sep-2009 07:10:18 {'M047'} {'OFF' } {0×0 char }
16-Sep-2009 07:10:20 {'M048'} {'ON' } {'R1_Work' }
16-Sep-2009 07:10:20 {'M046'} {'OFF' } {0×0 char }
04-Oct-2009 01:17:02 {'M029'} {'ON' } {0×0 char }
04-Oct-2009 01:17:04 {'M037'} {'ON' } {'R2_Personal_Hygiene' }
04-Oct-2009 01:17:04 {'M036'} {'OFF' } {0×0 char }
>>
I'd carry on from here by turning the Var3 (and possibly Var4 altho the mix of data there is confusing as to what will want as far as interpreting it) into categorical variable(s). Var5 might be more suited as a string than cellstr, your call. You need to provide meaningful variable names for them, of course.
You could try defining the import data format strings for the columns to read the date/time as a single datetime instead of as datetime/duration into two variables as above, an "exercise left for Student".
The file as it appeared when shortened is:
2009-09-06 04:19:31.032325 M049 ON R1_Bed_Toilet_Transition begin
2009-09-06 04:19:33.050316 M049 OFF
2009-09-08 08:12:56.088385 M025 ON
2009-09-08 08:12:58.088943 M024 ON
2009-09-08 08:12:58.091278 M026 OFF
2009-09-08 08:12:59.098453 M021 ON
2009-09-08 08:13:00.082885 M019 ON R2_Meal_Preparation begin
2009-09-08 08:13:01.091802 M021 OFF
2009-09-08 08:13:01.094138 M025 OFF
2009-09-08 08:13:02.035362 M018 ON
2009-09-15 13:57:58.081746 P001 1619
2009-09-15 13:57:58.090266 M017 OFF
2009-09-15 13:58:00.014018 M018 OFF
2009-09-15 13:58:01.082517 P001 1633
2009-09-15 13:58:11.050943 M018 ON
2009-09-15 13:58:15.082673 P001 1623
2009-09-15 13:58:16.066399 M019 OFF
2009-09-16 07:10:18.092304 M047 OFF
2009-09-16 07:10:20.048173 M048 ON R1_Work begin
2009-09-16 07:10:20.092411 M046 OFF
2009-10-04 01:17:02.070578 M029 ON
2009-10-04 01:17:04.040807 M037 ON R2_Personal_Hygiene begin
2009-10-04 01:17:04.043248 M036 OFF
>>

More Answers (2)

opt=detectImportOptions('Twor.txt',"ExpectedNumVariables",5,'ReadVariableNames',0,"NumHeaderLines",0)
opt.VariableNames={'Date','Sensor','Value','Activity'};
MTA=readtable('Twor.txt',opt);
MTA=MTA(contains(MTA.Activity,({'Sleep','Leave_Home'})),:); % save only the ''Sleep','Leave_Home' ' Activity for now
MTA.Sensor=categorical(MTA.Sensor); % turn sensor into categorical variable
MTA.Activity=categorical(MTA.Activity);
MTA.Value=str2double(MTA.Value); % and data into numeric
meanBySensor=rowfun(@mean,MTA,'InputVariables',{'Value'},'GroupingVariables',{'Sensor'},'OutputVariableNames','SensorMeans');
i try with your opt but don't change nothing

12 Comments

opt.VariableNames={'Date','Sensor','Value','Activity'};
is only four variable names for five input variables for starters.
Without knowing what is what, what are we supposed to do?
What is wrong with the preceding timetable that was created for a file, specifically? Need to resolve what, if anything, is wrong there first before try something new and different.
>> head(MTA)
ans =
8×3 timetable
Time Sensor Value Activity
____________________ ________ _________ _________________
08-Sep-2009 08:47:23 {'M025'} {'ON' } {'R2_Leave_Home'}
08-Sep-2009 08:47:31 {'D001'} {'CLOSE'} {'R2_Leave_Home'}
08-Sep-2009 08:57:50 {'M024'} {'ON' } {'R1_Leave_Home'}
08-Sep-2009 08:57:56 {'M024'} {'OFF' } {'R1_Leave_Home'}
08-Sep-2009 22:59:58 {'M046'} {'ON' } {'R1_Sleep' }
08-Sep-2009 23:05:46 {'M050'} {'ON' } {'R1_Sleep' }
08-Sep-2009 23:32:28 {'M034'} {'ON' } {'R2_Sleep' }
08-Sep-2009 23:38:13 {'M047'} {'ON' } {'R1_Sleep' }
>>
There's nothing numeric in the file for most of those records...there are some half-dozen or so that do have a number in the field, but only a handful in comparison.
That's in the file itself, though, not in reading the file.
>> Values=str2double(MTA.Value);
>> sum(isfinite(Values))
ans =
4
>> MTA(isfinite(Values),:)
ans =
4×3 timetable
Time Sensor Value Activity
____________________ ________ _______ ____________
10-Sep-2009 23:04:13 {'P001'} {'418'} {'R2_Sleep'}
26-Sep-2009 09:00:24 {'P001'} {'457'} {'R2_Sleep'}
27-Sep-2009 08:55:28 {'P001'} {'394'} {'R2_Sleep'}
27-Sep-2009 08:56:18 {'P001'} {'384'} {'R1_Sleep'}
>>
but witout setting variables name's i haven't the table reading correct
i have download a dataset about home occupacy, and i want to extract only the lines with the values ​​"Home" and "Sleep", but some of them are reading in Value coloumn, with the value of sensors,
As noted above, that appears to be inherent in the file the way it was built, not the fault of the reading.
Upload a SMALL section of the file that illustrates what you think is a problem, specifically; don't make us have to try to fight through some 20K records looking for what we don't know we're looking for...then explain clearly and succinctly what it is that you think should be versus what is.
You could go a long way towards solving such issues if you would create the files as comma or semicolon delimited and with quoted strings instead of blank/tab-delimited with unquoted strings.
That's the biggest part of your problem--a poorly constructed file format.
I uploaded a small portion of the file because the site does not let me upload large files.
what I need is to load the dataset and extract from this only some of the actions, like I said the actions with "Home" and "Sleep", but because of the dataset the information is read badly and I wanted to know if there was a way to force reading, such as setting a space as a division between columns. sorry if I'm unclear but I have started using matlab now. the file I uploaded previously is representative of the various forms in which the data is represented in the entire dataset
Well, it may be "small" in comparison to the real data set, but I don't consider 138,022 records small for we volunteers in the forum to have to wade through to try to help, sorry.
I went to the trouble to pare the file down to some 25 records or so that showed the disparity in spacing you referred to in the orginal question and showed how to read it consistently -- now you've apparently changed the ground rules of what the problem may be.
If so, provide a similar SMALL file that illustrates precisely what it is that you're after -- although from what I can gather from what you've said, the problem is inherent in the way the file was written and can't be solved trivially if you mean that the records which have a numeric value as shown above are supposed to have been written with a missing value where a sensor ID belongs and another column for whatever that numeric field represents.
The file simply wasn't written that way, from what I can tell easily with what have to work with without having to spend my time parsing out what it is that is intended to look at from the file.
"Help us help you!!!"
OK, here's the proof of the problem--
>> fid=fopen('dataprova.txt','r');
>> fmt=['#tabs=%d, Record: %s \n'];
>> for i=1:12315,fgetl(fid);end % skip over beginning records
>> for i=1:15,l=fgetl(fid);fprintf(fmt,sum(l==9),l);end % read next 15, analyze content
#tabs=2, Record: 2009-09-09 01:47:16.062345 M046 ON R1_Sleep begin
#tabs=2, Record: 2009-09-09 01:47:17.073655 M047 ON
#tabs=2, Record: 2009-09-09 01:47:19.086398 M047 OFF
#tabs=2, Record: 2009-09-09 01:47:23.007103 M046 OFF
#tabs=2, Record: 2009-09-09 01:48:14.008333 M047 ON
#tabs=2, Record: 2009-09-09 01:48:18.052422 M047 OFF
#tabs=2, Record: 2009-09-09 01:48:36.043365 M047 ON
#tabs=2, Record: 2009-09-09 01:48:37.073454 M047 OFF
#tabs=2, Record: 2009-09-09 01:48:41.088045 M047 ON
#tabs=2, Record: 2009-09-09 01:48:44.008428 M047 OFF
#tabs=2, Record: 2009-09-09 01:48:49.077298 P001 1001.6
#tabs=2, Record: 2009-09-09 01:48:49.082215 P001 370
#tabs=2, Record: 2009-09-09 01:48:54.053609 M047 ON
#tabs=2, Record: 2009-09-09 01:49:00.060148 M047 OFF
#tabs=2, Record: 2009-09-09 01:51:11.006981 T004 24.5
>> fid=fclose(fid);
>>
The above shows the records such as these have only two tabs regardless what the data type of the columns are; there's no way a priori on reading the file to put those two types of data into a different variable.
This isn't the fault of reading the data, it's that the data file is poorly constructed if it is intended/needed to separate those two out.
Of course, I showed you can easily-enough find those records--string2double() will return NaN for those that aren't valid numerals.
Expecting anything more is going to have to happen by either fixing the input file format or writing code to do whatever it is that is wanted to be done with the malformed files you have.
ok sorry, i don't think i changed the ground rules of what the problem may be, because to extract the value i need i way to read in the correct way the different coloumns, but if i use your code the dataset countinues continues to be read in a "wrong" way, in which the rows in which there are not enough spaces, the data are read in the same column even if they belong to another column, for example in the sensor value column also enters the action and for this reason then I cannot extract the information. now I understand that the problem is in the dataset. but first I was hoping you could find a way to make the various lines readable even if written in different ways within the same dataset. I hope I was clear
dpb
dpb on 21 Feb 2021
Edited: dpb on 21 Feb 2021
You're expecting the impossible...of creating information where it doesn't exist.
As the above shows, there SIMPLY IS NO DIFFERENCE in the two records of whether there is/is not a numeric value in the third variable position. PERIOD. END OF STORY. WISHING IT WERE OTHERWISE WON'T CHANGE FACTS.
If you want/need those values as some other variable, you'll have to create that as I showed how to find where they are and make a new variable of those values.
Or, more better, get whoever created these files to fix their output formatting.
However, I'll note that the routine I used AS I WROTE IT WITHOUT THE CHANGES YOU MADE will and does read the comment field separately from the action field reliably. See the prova timetable at the Answer above that shows Var3, Var4, Var5 are all separated correctly regardless of the content of Var4 column.
ADDUNDUM:
OBTW, it's even worse--I had seen the following section first:
>> for i=1:15,l=fgetl(fid);fprintf(fmt,sum(l==9),l);end
#tabs=3, Record: 2009-09-06 04:19:31.032325 M049 ON R1_Bed_Toilet_Transition begin,
#tabs=2, Record: 2009-09-06 04:19:33.050316 M049 OFF,
#tabs=2, Record: 2009-09-08 08:12:56.088385 M025 ON,
#tabs=2, Record: 2009-09-08 08:12:58.088943 M024 ON,
#tabs=2, Record: 2009-09-08 08:12:58.091278 M026 OFF,
#tabs=2, Record: 2009-09-08 08:12:59.098453 M021 ON,
#tabs=3, Record: 2009-09-08 08:13:00.082885 M019 ON R2_Meal_Preparation begin,
...
in which you'll note there are 3 tabs for the fields with remarks after the action(?) column. Those are parsed correctly as noted above.
However, when I did another section that has more numbers, then we see:
>> for i=1:12315,fgetl(fid);end
>> for i=1:15,l=fgetl(fid);fprintf(fmt,sum(l==9),l);end
#tabs=2, Record: 2009-09-09 01:47:16.062345 M046 ON R1_Sleep begin,
#tabs=2, Record: 2009-09-09 01:47:17.073655 M047 ON,
and the first output record there is missing the other delimiter as well. You'll have to also find records with longer string there and split it out.
(NB: The trailing comma above is a figment of the format string used--it had a spurious trailing comma in it; it's not real)
ok but how can i use this results?
I showed that
>> opt=detectImportOptions('dataprova.txt',"ExpectedNumVariables",5,'ReadVariableNames',0,"NumHeaderLines",0);
>> prova=readtable('dataprova.txt',opt);
>> prova=table2timetable(prova(:,3:end),'RowTimes',prova.Var1+prova.Var2)
prova =
23×3 timetable
Time Var3 Var4 Var5
____________________ ________ ________ ____________________________
06-Sep-2009 04:19:31 {'M049'} {'ON' } {'R1_Bed_Toilet_Transition'}
06-Sep-2009 04:19:33 {'M049'} {'OFF' } {0×0 char }
08-Sep-2009 08:12:56 {'M025'} {'ON' } {0×0 char }
08-Sep-2009 08:12:58 {'M024'} {'ON' } {0×0 char }
08-Sep-2009 08:12:58 {'M026'} {'OFF' } {0×0 char }
08-Sep-2009 08:12:59 {'M021'} {'ON' } {0×0 char }
08-Sep-2009 08:13:00 {'M019'} {'ON' } {'R2_Meal_Preparation' }
08-Sep-2009 08:13:01 {'M021'} {'OFF' } {0×0 char }
08-Sep-2009 08:13:01 {'M025'} {'OFF' } {0×0 char }
08-Sep-2009 08:13:02 {'M018'} {'ON' } {0×0 char }
15-Sep-2009 13:57:58 {'P001'} {'1619'} {0×0 char }
15-Sep-2009 13:57:58 {'M017'} {'OFF' } {0×0 char }
15-Sep-2009 13:58:00 {'M018'} {'OFF' } {0×0 char }
15-Sep-2009 13:58:01 {'P001'} {'1633'} {0×0 char }
15-Sep-2009 13:58:11 {'M018'} {'ON' } {0×0 char }
15-Sep-2009 13:58:15 {'P001'} {'1623'} {0×0 char }
15-Sep-2009 13:58:16 {'M019'} {'OFF' } {0×0 char }
16-Sep-2009 07:10:18 {'M047'} {'OFF' } {0×0 char }
16-Sep-2009 07:10:20 {'M048'} {'ON' } {'R1_Work' }
16-Sep-2009 07:10:20 {'M046'} {'OFF' } {0×0 char }
04-Oct-2009 01:17:02 {'M029'} {'ON' } {0×0 char }
04-Oct-2009 01:17:04 {'M037'} {'ON' } {'R2_Personal_Hygiene' }
04-Oct-2009 01:17:04 {'M036'} {'OFF' } {0×0 char }
>>
Continuing on from there is straightforward-enough...
prova.Properties.VariableNames={'Sensor','Value','Activity'}; % set variable names
prova.State=prova.Value; % create new "State" column
prova.Value=str2double(prova.Value); % convert Value to numeric
prova.State(isfinite(prova.Value))={''}; % set State numeric-->missing
prova.State=categorical(prova.State); % convert to categorical
prova.Sensor=categorical(prova.Sensor);
prova.Activity=categorical(prova.Activity);
prova=prova(:,[1 2 4 3]); % rearrange order for convenience
Above results in:
>> prova
prova =
23×4 timetable
Time Sensor Value State Activity
____________________ ______ _____ ___________ ________________________
06-Sep-2009 04:19:31 M049 NaN ON R1_Bed_Toilet_Transition
06-Sep-2009 04:19:33 M049 NaN OFF <undefined>
08-Sep-2009 08:12:56 M025 NaN ON <undefined>
08-Sep-2009 08:12:58 M024 NaN ON <undefined>
08-Sep-2009 08:12:58 M026 NaN OFF <undefined>
08-Sep-2009 08:12:59 M021 NaN ON <undefined>
08-Sep-2009 08:13:00 M019 NaN ON R2_Meal_Preparation
08-Sep-2009 08:13:01 M021 NaN OFF <undefined>
08-Sep-2009 08:13:01 M025 NaN OFF <undefined>
08-Sep-2009 08:13:02 M018 NaN ON <undefined>
15-Sep-2009 13:57:58 P001 1619 <undefined> <undefined>
15-Sep-2009 13:57:58 M017 NaN OFF <undefined>
15-Sep-2009 13:58:00 M018 NaN OFF <undefined>
15-Sep-2009 13:58:01 P001 1633 <undefined> <undefined>
15-Sep-2009 13:58:11 M018 NaN ON <undefined>
15-Sep-2009 13:58:15 P001 1623 <undefined> <undefined>
15-Sep-2009 13:58:16 M019 NaN OFF <undefined>
16-Sep-2009 07:10:18 M047 NaN OFF <undefined>
16-Sep-2009 07:10:20 M048 NaN ON R1_Work
16-Sep-2009 07:10:20 M046 NaN OFF <undefined>
04-Oct-2009 01:17:02 M029 NaN ON <undefined>
04-Oct-2009 01:17:04 M037 NaN ON R2_Personal_Hygiene
04-Oct-2009 01:17:04 M036 NaN OFF <undefined>
>>
As noted, you'll want to find the 'State' records that also contain the Activity comment and split those off as well; it's just more of similar kind of edit operations to find "who's who in the zoo!" of containing something other than what is supposed to be in the field.
It still would be far better to get the data written correctly to begin with; but if it can't be regenerated from some other format in which it has been save, then something like the above is the option available to you.

Sign in to comment.

Categories

Asked:

on 21 Feb 2021

Commented:

dpb
on 21 Feb 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!