READ TABLE WITH DIFFERENT SETTING INSIDE

Emanuele De Astis

21 Feb 2021

3 Answers

Answer Accepted

Updated 21 Feb 2021

2 Views (30 days)

Follow Question

Show older comments

0 votes

hello everyone, i have a problem with a dataset, in the same dataset, the data in the rows are writen in different way, in particular, have different space in the table (first and second screen) and matlab, read only the data written in the first way (with a lot of space from tcoluomns), someone knows the way to read both ?

with small data i'm able to load the file in excell and save, but this is bigger for excell

0 Comments
Show -2 older comments Hide -2 older comments

Follow Question

Accepted Answer

dpb on 21 Feb 2021

Edited: dpb on 21 Feb 2021

Open in MATLAB Online

0 votes

It's a fignewton of looking at the text in the editor or on the screen--some was tab-delimited, some space but there's only one delimiter between columns.

I shortened the first (LH) attachment to only a few lines that contained samples of both...then used detectImportOptions with some guidance and seems to work just fine--

>> opt=detectImportOptions('dataprova.txt',"ExpectedNumVariables",5,'ReadVariableNames',0,"NumHeaderLines",0);
>> prova=readtable('dataprova.txt',opt);
>> prova=table2timetable(prova(:,3:end),'RowTimes',prova.Var1+prova.Var2)
prova =
  23×3 timetable
            Time              Var3        Var4                  Var5            
    ____________________    ________    ________    ____________________________
    06-Sep-2009 04:19:31    {'M049'}    {'ON'  }    {'R1_Bed_Toilet_Transition'}
    06-Sep-2009 04:19:33    {'M049'}    {'OFF' }    {0×0 char                  }
    08-Sep-2009 08:12:56    {'M025'}    {'ON'  }    {0×0 char                  }
    08-Sep-2009 08:12:58    {'M024'}    {'ON'  }    {0×0 char                  }
    08-Sep-2009 08:12:58    {'M026'}    {'OFF' }    {0×0 char                  }
    08-Sep-2009 08:12:59    {'M021'}    {'ON'  }    {0×0 char                  }
    08-Sep-2009 08:13:00    {'M019'}    {'ON'  }    {'R2_Meal_Preparation'     }
    08-Sep-2009 08:13:01    {'M021'}    {'OFF' }    {0×0 char                  }
    08-Sep-2009 08:13:01    {'M025'}    {'OFF' }    {0×0 char                  }
    08-Sep-2009 08:13:02    {'M018'}    {'ON'  }    {0×0 char                  }
    15-Sep-2009 13:57:58    {'P001'}    {'1619'}    {0×0 char                  }
    15-Sep-2009 13:57:58    {'M017'}    {'OFF' }    {0×0 char                  }
    15-Sep-2009 13:58:00    {'M018'}    {'OFF' }    {0×0 char                  }
    15-Sep-2009 13:58:01    {'P001'}    {'1633'}    {0×0 char                  }
    15-Sep-2009 13:58:11    {'M018'}    {'ON'  }    {0×0 char                  }
    15-Sep-2009 13:58:15    {'P001'}    {'1623'}    {0×0 char                  }
    15-Sep-2009 13:58:16    {'M019'}    {'OFF' }    {0×0 char                  }
    16-Sep-2009 07:10:18    {'M047'}    {'OFF' }    {0×0 char                  }
    16-Sep-2009 07:10:20    {'M048'}    {'ON'  }    {'R1_Work'                 }
    16-Sep-2009 07:10:20    {'M046'}    {'OFF' }    {0×0 char                  }
    04-Oct-2009 01:17:02    {'M029'}    {'ON'  }    {0×0 char                  }
    04-Oct-2009 01:17:04    {'M037'}    {'ON'  }    {'R2_Personal_Hygiene'     }
    04-Oct-2009 01:17:04    {'M036'}    {'OFF' }    {0×0 char                  }
>> 

I'd carry on from here by turning the Var3 (and possibly Var4 altho the mix of data there is confusing as to what will want as far as interpreting it) into categorical variable(s). Var5 might be more suited as a string than cellstr, your call. You need to provide meaningful variable names for them, of course.

You could try defining the import data format strings for the columns to read the date/time as a single datetime instead of as datetime/duration into two variables as above, an "exercise left for Student".

The file as it appeared when shortened is:

2009-09-06 04:19:31.032325	M049	ON	R1_Bed_Toilet_Transition begin
2009-09-06 04:19:33.050316	M049	OFF
2009-09-08 08:12:56.088385	M025	ON
2009-09-08 08:12:58.088943	M024	ON
2009-09-08 08:12:58.091278	M026	OFF
2009-09-08 08:12:59.098453	M021	ON
2009-09-08 08:13:00.082885	M019	ON	R2_Meal_Preparation begin
2009-09-08 08:13:01.091802	M021	OFF
2009-09-08 08:13:01.094138	M025	OFF
2009-09-08 08:13:02.035362	M018	ON
2009-09-15 13:57:58.081746	P001	1619
2009-09-15 13:57:58.090266	M017	OFF
2009-09-15 13:58:00.014018	M018	OFF
2009-09-15 13:58:01.082517 P001 1633
2009-09-15 13:58:11.050943 M018 ON
2009-09-15 13:58:15.082673 P001 1623
2009-09-15 13:58:16.066399 M019 OFF
2009-09-16 07:10:18.092304 M047 OFF
2009-09-16 07:10:20.048173 M048 ON R1_Work begin
2009-09-16 07:10:20.092411 M046 OFF
2009-10-04 01:17:02.070578	M029	ON
2009-10-04 01:17:04.040807	M037	ON R2_Personal_Hygiene begin
2009-10-04 01:17:04.043248	M036	OFF
>> 

0 Comments
Show -2 older comments Hide -2 older comments

12 Comments
Show 10 older comments Hide 10 older comments

dpb on 21 Feb 2021

Edited: dpb on 21 Feb 2021

Open in MATLAB Online

opt.VariableNames={'Date','Sensor','Value','Activity'};

is only four variable names for five input variables for starters.

Without knowing what is what, what are we supposed to do?

What is wrong with the preceding timetable that was created for a file, specifically? Need to resolve what, if anything, is wrong there first before try something new and different.

>> head(MTA)
ans =
  8×3 timetable
            Time             Sensor       Value          Activity     
    ____________________    ________    _________    _________________
    08-Sep-2009 08:47:23    {'M025'}    {'ON'   }    {'R2_Leave_Home'}
    08-Sep-2009 08:47:31    {'D001'}    {'CLOSE'}    {'R2_Leave_Home'}
    08-Sep-2009 08:57:50    {'M024'}    {'ON'   }    {'R1_Leave_Home'}
    08-Sep-2009 08:57:56    {'M024'}    {'OFF'  }    {'R1_Leave_Home'}
    08-Sep-2009 22:59:58    {'M046'}    {'ON'   }    {'R1_Sleep'     }
    08-Sep-2009 23:05:46    {'M050'}    {'ON'   }    {'R1_Sleep'     }
    08-Sep-2009 23:32:28    {'M034'}    {'ON'   }    {'R2_Sleep'     }
    08-Sep-2009 23:38:13    {'M047'}    {'ON'   }    {'R1_Sleep'     }
>> 

There's nothing numeric in the file for most of those records...there are some half-dozen or so that do have a number in the field, but only a handful in comparison.

That's in the file itself, though, not in reading the file.

>> Values=str2double(MTA.Value);
>> sum(isfinite(Values))
ans =
     4
>> MTA(isfinite(Values),:)
ans =
  4×3 timetable
            Time             Sensor      Value       Activity  
    ____________________    ________    _______    ____________
    10-Sep-2009 23:04:13    {'P001'}    {'418'}    {'R2_Sleep'}
    26-Sep-2009 09:00:24    {'P001'}    {'457'}    {'R2_Sleep'}
    27-Sep-2009 08:55:28    {'P001'}    {'394'}    {'R2_Sleep'}
    27-Sep-2009 08:56:18    {'P001'}    {'384'}    {'R1_Sleep'}
>> 

dpb on 21 Feb 2021

Well, it may be "small" in comparison to the real data set, but I don't consider 138,022 records small for we volunteers in the forum to have to wade through to try to help, sorry.

I went to the trouble to pare the file down to some 25 records or so that showed the disparity in spacing you referred to in the orginal question and showed how to read it consistently -- now you've apparently changed the ground rules of what the problem may be.

If so, provide a similar SMALL file that illustrates precisely what it is that you're after -- although from what I can gather from what you've said, the problem is inherent in the way the file was written and can't be solved trivially if you mean that the records which have a numeric value as shown above are supposed to have been written with a missing value where a sensor ID belongs and another column for whatever that numeric field represents.

The file simply wasn't written that way, from what I can tell easily with what have to work with without having to spend my time parsing out what it is that is intended to look at from the file.

"Help us help you!!!"

dpb on 21 Feb 2021

Edited: dpb on 21 Feb 2021

Open in MATLAB Online

OK, here's the proof of the problem--

>> fid=fopen('dataprova.txt','r');
>> fmt=['#tabs=%d, Record: %s \n'];
>> for i=1:12315,fgetl(fid);end                             % skip over beginning records
>> for i=1:15,l=fgetl(fid);fprintf(fmt,sum(l==9),l);end     % read next 15, analyze content
#tabs=2, Record: 2009-09-09 01:47:16.062345	M046	ON R1_Sleep begin 
#tabs=2, Record: 2009-09-09 01:47:17.073655	M047	ON
#tabs=2, Record: 2009-09-09 01:47:19.086398	M047	OFF
#tabs=2, Record: 2009-09-09 01:47:23.007103	M046	OFF 
#tabs=2, Record: 2009-09-09 01:48:14.008333	M047	ON 
#tabs=2, Record: 2009-09-09 01:48:18.052422	M047	OFF
#tabs=2, Record: 2009-09-09 01:48:36.043365	M047	ON 
#tabs=2, Record: 2009-09-09 01:48:37.073454	M047	OFF 
#tabs=2, Record: 2009-09-09 01:48:41.088045	M047	ON 
#tabs=2, Record: 2009-09-09 01:48:44.008428	M047	OFF 
#tabs=2, Record: 2009-09-09 01:48:49.077298	P001	1001.6 
#tabs=2, Record: 2009-09-09 01:48:49.082215	P001	370 
#tabs=2, Record: 2009-09-09 01:48:54.053609	M047	ON 
#tabs=2, Record: 2009-09-09 01:49:00.060148	M047	OFF 
#tabs=2, Record: 2009-09-09 01:51:11.006981	T004	24.5 
>> fid=fclose(fid);
>> 

The above shows the records such as these have only two tabs regardless what the data type of the columns are; there's no way a priori on reading the file to put those two types of data into a different variable.

This isn't the fault of reading the data, it's that the data file is poorly constructed if it is intended/needed to separate those two out.

Of course, I showed you can easily-enough find those records--string2double() will return NaN for those that aren't valid numerals.

Expecting anything more is going to have to happen by either fixing the input file format or writing code to do whatever it is that is wanted to be done with the malformed files you have.

dpb on 21 Feb 2021

Edited: dpb on 21 Feb 2021

Open in MATLAB Online

ADDUNDUM:

OBTW, it's even worse--I had seen the following section first:

>> for i=1:15,l=fgetl(fid);fprintf(fmt,sum(l==9),l);end
#tabs=3, Record: 2009-09-06 04:19:31.032325	M049	ON	R1_Bed_Toilet_Transition begin, 
#tabs=2, Record: 2009-09-06 04:19:33.050316	M049	OFF, 
#tabs=2, Record: 2009-09-08 08:12:56.088385	M025	ON, 
#tabs=2, Record: 2009-09-08 08:12:58.088943	M024	ON, 
#tabs=2, Record: 2009-09-08 08:12:58.091278	M026	OFF, 
#tabs=2, Record: 2009-09-08 08:12:59.098453	M021	ON, 
#tabs=3, Record: 2009-09-08 08:13:00.082885	M019	ON	R2_Meal_Preparation begin, 
...

in which you'll note there are 3 tabs for the fields with remarks after the action(?) column. Those are parsed correctly as noted above.

However, when I did another section that has more numbers, then we see:

>> for i=1:12315,fgetl(fid);end
>> for i=1:15,l=fgetl(fid);fprintf(fmt,sum(l==9),l);end
#tabs=2, Record: 2009-09-09 01:47:16.062345	M046	ON R1_Sleep begin, 
#tabs=2, Record: 2009-09-09 01:47:17.073655	M047	ON, 

and the first output record there is missing the other delimiter as well. You'll have to also find records with longer string there and split it out.

(NB: The trailing comma above is a figment of the format string used--it had a spurious trailing comma in it; it's not real)

Emanuele De Astis on 21 Feb 2021

ok but how can i use this results?

dpb on 21 Feb 2021

Open in MATLAB Online

I showed that

>> opt=detectImportOptions('dataprova.txt',"ExpectedNumVariables",5,'ReadVariableNames',0,"NumHeaderLines",0);
>> prova=readtable('dataprova.txt',opt);
>> prova=table2timetable(prova(:,3:end),'RowTimes',prova.Var1+prova.Var2)
prova =
  23×3 timetable
            Time              Var3        Var4                  Var5            
    ____________________    ________    ________    ____________________________
    06-Sep-2009 04:19:31    {'M049'}    {'ON'  }    {'R1_Bed_Toilet_Transition'}
    06-Sep-2009 04:19:33    {'M049'}    {'OFF' }    {0×0 char                  }
    08-Sep-2009 08:12:56    {'M025'}    {'ON'  }    {0×0 char                  }
    08-Sep-2009 08:12:58    {'M024'}    {'ON'  }    {0×0 char                  }
    08-Sep-2009 08:12:58    {'M026'}    {'OFF' }    {0×0 char                  }
    08-Sep-2009 08:12:59    {'M021'}    {'ON'  }    {0×0 char                  }
    08-Sep-2009 08:13:00    {'M019'}    {'ON'  }    {'R2_Meal_Preparation'     }
    08-Sep-2009 08:13:01    {'M021'}    {'OFF' }    {0×0 char                  }
    08-Sep-2009 08:13:01    {'M025'}    {'OFF' }    {0×0 char                  }
    08-Sep-2009 08:13:02    {'M018'}    {'ON'  }    {0×0 char                  }
    15-Sep-2009 13:57:58    {'P001'}    {'1619'}    {0×0 char                  }
    15-Sep-2009 13:57:58    {'M017'}    {'OFF' }    {0×0 char                  }
    15-Sep-2009 13:58:00    {'M018'}    {'OFF' }    {0×0 char                  }
    15-Sep-2009 13:58:01    {'P001'}    {'1633'}    {0×0 char                  }
    15-Sep-2009 13:58:11    {'M018'}    {'ON'  }    {0×0 char                  }
    15-Sep-2009 13:58:15    {'P001'}    {'1623'}    {0×0 char                  }
    15-Sep-2009 13:58:16    {'M019'}    {'OFF' }    {0×0 char                  }
    16-Sep-2009 07:10:18    {'M047'}    {'OFF' }    {0×0 char                  }
    16-Sep-2009 07:10:20    {'M048'}    {'ON'  }    {'R1_Work'                 }
    16-Sep-2009 07:10:20    {'M046'}    {'OFF' }    {0×0 char                  }
    04-Oct-2009 01:17:02    {'M029'}    {'ON'  }    {0×0 char                  }
    04-Oct-2009 01:17:04    {'M037'}    {'ON'  }    {'R2_Personal_Hygiene'     }
    04-Oct-2009 01:17:04    {'M036'}    {'OFF' }    {0×0 char                  }
>> 

Continuing on from there is straightforward-enough...

prova.Properties.VariableNames={'Sensor','Value','Activity'};   % set variable names
prova.State=prova.Value;                                        % create new "State" column
prova.Value=str2double(prova.Value);                            % convert Value to numeric
prova.State(isfinite(prova.Value))={''};                        % set State numeric-->missing
prova.State=categorical(prova.State);                           % convert to categorical
prova.Sensor=categorical(prova.Sensor);
prova.Activity=categorical(prova.Activity);
prova=prova(:,[1 2 4 3]);                        % rearrange order for convenience

Above results in:

>> prova
prova =
  23×4 timetable
            Time            Sensor    Value       State               Activity        
    ____________________    ______    _____    ___________    ________________________
    06-Sep-2009 04:19:31     M049      NaN     ON             R1_Bed_Toilet_Transition
    06-Sep-2009 04:19:33     M049      NaN     OFF            <undefined>             
    08-Sep-2009 08:12:56     M025      NaN     ON             <undefined>             
    08-Sep-2009 08:12:58     M024      NaN     ON             <undefined>             
    08-Sep-2009 08:12:58     M026      NaN     OFF            <undefined>             
    08-Sep-2009 08:12:59     M021      NaN     ON             <undefined>             
    08-Sep-2009 08:13:00     M019      NaN     ON             R2_Meal_Preparation     
    08-Sep-2009 08:13:01     M021      NaN     OFF            <undefined>             
    08-Sep-2009 08:13:01     M025      NaN     OFF            <undefined>             
    08-Sep-2009 08:13:02     M018      NaN     ON             <undefined>             
    15-Sep-2009 13:57:58     P001     1619     <undefined>    <undefined>             
    15-Sep-2009 13:57:58     M017      NaN     OFF            <undefined>             
    15-Sep-2009 13:58:00     M018      NaN     OFF            <undefined>             
    15-Sep-2009 13:58:01     P001     1633     <undefined>    <undefined>             
    15-Sep-2009 13:58:11     M018      NaN     ON             <undefined>             
    15-Sep-2009 13:58:15     P001     1623     <undefined>    <undefined>             
    15-Sep-2009 13:58:16     M019      NaN     OFF            <undefined>             
    16-Sep-2009 07:10:18     M047      NaN     OFF            <undefined>             
    16-Sep-2009 07:10:20     M048      NaN     ON             R1_Work                 
    16-Sep-2009 07:10:20     M046      NaN     OFF            <undefined>             
    04-Oct-2009 01:17:02     M029      NaN     ON             <undefined>             
    04-Oct-2009 01:17:04     M037      NaN     ON             R2_Personal_Hygiene     
    04-Oct-2009 01:17:04     M036      NaN     OFF            <undefined>             
>> 

As noted, you'll want to find the 'State' records that also contain the Activity comment and split those off as well; it's just more of similar kind of edit operations to find "who's who in the zoo!" of containing something other than what is supposed to be in the field.

It still would be far better to get the data written correctly to begin with; but if it can't be regenerated from some other format in which it has been save, then something like the above is the option available to you.

Emanuele De Astis on 21 Feb 2021

0 votes

0 Comments
Show -2 older comments Hide -2 older comments

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

READ TABLE WITH DIFFERENT SETTING INSIDE

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

0 Comments
Show -2 older comments Hide -2 older comments

More Answers (2)

12 Comments
Show 10 older comments Hide 10 older comments

0 Comments
Show -2 older comments Hide -2 older comments

Categories

Tags

Community Treasure Hunt

READ TABLE WITH DIFFERENT SETTING INSIDE

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

0 Comments Show -2 older comments Hide -2 older comments

More Answers (2)

12 Comments Show 10 older comments Hide 10 older comments

0 Comments Show -2 older comments Hide -2 older comments

Categories

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

0 Comments
Show -2 older comments Hide -2 older comments

12 Comments
Show 10 older comments Hide 10 older comments

0 Comments
Show -2 older comments Hide -2 older comments