Documentation

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English verison of the page.

Note: This page has been translated by MathWorks. Please click here
To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

Grouped Statistics Calculations with Tall Arrays

This example shows how to use the findgroups and splitapply functions to calculate grouped statistics of a tall timetable containing power outage data. findgroups and splitapply enable you to break up tall variables into groups, use those groups to separate data, and then apply a function to each group of data. Alternatively, if you have Statistics and Machine Learning Toolbox™, then you also can use the grpstats function to calculate grouped statistics.

This example creates a tall timetable for the power outage data, even though the raw data only has about 1500 rows. However, you can use the techniques presented here on much larger data sets because no assumptions are made about the size of the data.

Create Datastore and Tall Timetable

The sample file, outages.csv, contains data representing electric utility outages in the United States. The file contains six columns: Region, OutageTime, Loss, Customers, RestorationTime, and Cause.

Create a datastore for the outages.csv file. Use the 'TextScanFormats' option to specify the kind of data each column contains: categorical ('%C'), floating-point numeric ('%f'), or datetime ('%D').

data_formats = {'%C','%D','%f','%f','%D','%C'};
ds = datastore('outages.csv','TextscanFormats',data_formats);

Create a tall table on top of the datastore, and convert the tall table into a tall timetable. The OutageTime variable is used for the row times since it is the first datetime or duration variable in the table.

T = tall(ds);
T = table2timetable(T)
T =

  Mx5 tall timetable

       OutageTime        Region       Loss     Customers     RestorationTime          Cause     
    ________________    _________    ______    __________    ________________    _______________

    2002-02-01 12:18    SouthWest    458.98    1.8202e+06    2002-02-07 16:50    winter storm   
    2003-01-23 00:49    SouthEast    530.14    2.1204e+05    NaT                 winter storm   
    2003-02-07 21:15    SouthEast     289.4    1.4294e+05    2003-02-17 08:14    winter storm   
    2004-04-06 05:44    West         434.81    3.4037e+05    2004-04-06 06:10    equipment fault
    2002-03-16 06:18    MidWest      186.44    2.1275e+05    2002-03-18 23:23    severe storm   
    2003-06-18 02:49    West              0             0    2003-06-18 10:54    attack         
    2004-06-20 14:39    West         231.29           NaN    2004-06-20 19:16    equipment fault
    2002-06-06 19:28    West         311.86           NaN    2002-06-07 00:51    equipment fault
    :                   :            :         :             :                   :
    :                   :            :         :             :                   :

Clean Missing Data

Some of the rows in the tall table have missing data represented by NaN and NaT values. Remove all of the rows that are missing at least one piece of data.

idx = ~any(ismissing(T),2);
T = T(idx,:)
T =

  Mx5 tall timetable

       OutageTime        Region       Loss     Customers     RestorationTime          Cause     
    ________________    _________    ______    __________    ________________    _______________

    2002-02-01 12:18    SouthWest    458.98    1.8202e+06    2002-02-07 16:50    winter storm   
    2003-02-07 21:15    SouthEast     289.4    1.4294e+05    2003-02-17 08:14    winter storm   
    2004-04-06 05:44    West         434.81    3.4037e+05    2004-04-06 06:10    equipment fault
    2002-03-16 06:18    MidWest      186.44    2.1275e+05    2002-03-18 23:23    severe storm   
    2003-06-18 02:49    West              0             0    2003-06-18 10:54    attack         
    2003-07-16 16:23    NorthEast    239.93         49434    2003-07-17 01:12    fire           
    2004-09-27 11:09    MidWest      286.72         66104    2004-09-27 16:37    equipment fault
    2004-09-05 17:48    SouthEast    73.387         36073    2004-09-05 20:46    equipment fault
    :                   :            :         :             :                   :
    :                   :            :         :             :                   :

Mean Power Outage Duration by Region

Determine the mean power outage duration in each region. The findgroups function groups the data by the categorical values in Region. The splitapply function applies the specified function to each group of data and concatenates the results together.

G = findgroups(T.Region);
times = gather(splitapply(@mean,T.RestorationTime-T.OutageTime,G))
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 2: Completed in 0 sec
- Pass 2 of 2: Completed in 0 sec
Evaluation completed in 2 sec
times = 5x1 duration array
   1254:11:20
     44:29:29
     44:02:22
     48:30:36
     23:58:28

Change the display format of the duration results to be in days, and put the results in a table with the associated regions.

times.Format = 'd';
regions = gather(categories(T.Region));
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 0 sec
Evaluation completed in 0 sec
varnames = {'Regions','MeanOutageDuration'};
maxOutageDurations = table(regions,times,'VariableNames',varnames)
maxOutageDurations=5x2 table
      Regions      MeanOutageDuration
    ___________    __________________

    'MidWest'       52.258 days      
    'NorthEast'     1.8538 days      
    'SouthEast'      1.835 days      
    'SouthWest'     2.0212 days      
    'West'         0.99895 days      

Most Common Power Outage Causes by Region

Determine how often each power outage cause occurs in each region. First, group the data by both cause and region. Then use splitapply to create a cell array containing the number of occurrences of each cause in each region.

G2 = findgroups(T.Cause,T.Region);
C = splitapply(@(r,c) {size(r,1),r(1),c(1)},T.Region,T.Cause,G2);
C = gather(C)
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 2: Completed in 0 sec
- Pass 2 of 2: Completed in 1 sec
Evaluation completed in 4 sec
C = 43x3 cell array
    {[ 4]}    {[MidWest  ]}    {[attack          ]}
    {[75]}    {[NorthEast]}    {[attack          ]}
    {[ 6]}    {[SouthEast]}    {[attack          ]}
    {[44]}    {[West     ]}    {[attack          ]}
    {[ 1]}    {[NorthEast]}    {[earthquake      ]}
    {[ 1]}    {[West     ]}    {[earthquake      ]}
    {[11]}    {[MidWest  ]}    {[energy emergency]}
    {[11]}    {[NorthEast]}    {[energy emergency]}
    {[39]}    {[SouthEast]}    {[energy emergency]}
    {[ 5]}    {[SouthWest]}    {[energy emergency]}
    {[19]}    {[West     ]}    {[energy emergency]}
    {[ 6]}    {[MidWest  ]}    {[equipment fault ]}
    {[13]}    {[NorthEast]}    {[equipment fault ]}
    {[28]}    {[SouthEast]}    {[equipment fault ]}
    {[ 1]}    {[SouthWest]}    {[equipment fault ]}
    {[50]}    {[West     ]}    {[equipment fault ]}
    {[ 4]}    {[NorthEast]}    {[fire            ]}
    {[ 2]}    {[SouthEast]}    {[fire            ]}
    {[10]}    {[West     ]}    {[fire            ]}
    {[17]}    {[MidWest  ]}    {[severe storm    ]}
    {[54]}    {[NorthEast]}    {[severe storm    ]}
    {[86]}    {[SouthEast]}    {[severe storm    ]}
    {[ 4]}    {[SouthWest]}    {[severe storm    ]}
    {[13]}    {[West     ]}    {[severe storm    ]}
    {[22]}    {[MidWest  ]}    {[thunder storm   ]}
    {[37]}    {[NorthEast]}    {[thunder storm   ]}
    {[39]}    {[SouthEast]}    {[thunder storm   ]}
    {[ 6]}    {[SouthWest]}    {[thunder storm   ]}
    {[ 4]}    {[West     ]}    {[thunder storm   ]}
    {[ 4]}    {[MidWest  ]}    {[unknown         ]}
    {[ 4]}    {[NorthEast]}    {[unknown         ]}
    {[ 2]}    {[SouthEast]}    {[unknown         ]}
    {[ 1]}    {[West     ]}    {[unknown         ]}
    {[12]}    {[MidWest  ]}    {[wind            ]}
    {[19]}    {[NorthEast]}    {[wind            ]}
    {[11]}    {[SouthEast]}    {[wind            ]}
    {[ 3]}    {[SouthWest]}    {[wind            ]}
    {[15]}    {[West     ]}    {[wind            ]}
    {[ 9]}    {[MidWest  ]}    {[winter storm    ]}
    {[30]}    {[NorthEast]}    {[winter storm    ]}
    {[23]}    {[SouthEast]}    {[winter storm    ]}
    {[ 1]}    {[SouthWest]}    {[winter storm    ]}
    {[17]}    {[West     ]}    {[winter storm    ]}

Convert the cell array into a table and unstack the 'Count' and 'Region' variables. Use fillmissing on the in-memory table to replace NaN values with zeros.

tmp = cell2table(C, 'VariableNames', {'Count', 'Region', 'Cause'});
RegionCauses = unstack(tmp, 'Count', 'Region');
RegionCauses = fillmissing(RegionCauses,'constant',{'',0,0,0,0,0})
RegionCauses=10x6 table
         Cause          MidWest    NorthEast    SouthEast    SouthWest    West
    ________________    _______    _________    _________    _________    ____

    attack               4         75            6           0            44  
    earthquake           0          1            0           0             1  
    energy emergency    11         11           39           5            19  
    equipment fault      6         13           28           1            50  
    fire                 0          4            2           0            10  
    severe storm        17         54           86           4            13  
    thunder storm       22         37           39           6             4  
    unknown              4          4            2           0             1  
    wind                12         19           11           3            15  
    winter storm         9         30           23           1            17  

See Also

| |

Related Topics

Was this topic helpful?