Main Content

Perform Calculations by Group in Table

You can perform calculations on groups of data within table variables by using these functions:

  • groupsummary, groupcounts, groupfilter, and grouptransform

  • varfun and rowfun

  • findgroups and splitapply

In most cases, groupsummary is the recommended function for grouped calculations. It is simple to use and returns a table with labels that describe results. The other listed functions, however, also offer capabilities that can be useful in some situations.

Create Table from File

The sample spreadsheet outages.csv contains data values that represent electric utility power outages in the United States. To create a table from the file, use the readtable function. To read text data from the file into table variables that are string arrays, specify the TextType name-value argument as "string".

outages = readtable("outages.csv","TextType","string")
outages=1468×6 table
      Region            OutageTime          Loss     Customers       RestorationTime             Cause      
    ___________    ____________________    ______    __________    ____________________    _________________

    "SouthWest"    01-Feb-2002 12:18:00    458.98    1.8202e+06    07-Feb-2002 16:50:00    "winter storm"   
    "SouthEast"    23-Jan-2003 00:49:00    530.14    2.1204e+05                     NaT    "winter storm"   
    "SouthEast"    07-Feb-2003 21:15:00     289.4    1.4294e+05    17-Feb-2003 08:14:00    "winter storm"   
    "West"         06-Apr-2004 05:44:00    434.81    3.4037e+05    06-Apr-2004 06:10:00    "equipment fault"
    "MidWest"      16-Mar-2002 06:18:00    186.44    2.1275e+05    18-Mar-2002 23:23:00    "severe storm"   
    "West"         18-Jun-2003 02:49:00         0             0    18-Jun-2003 10:54:00    "attack"         
    "West"         20-Jun-2004 14:39:00    231.29           NaN    20-Jun-2004 19:16:00    "equipment fault"
    "West"         06-Jun-2002 19:28:00    311.86           NaN    07-Jun-2002 00:51:00    "equipment fault"
    "NorthEast"    16-Jul-2003 16:23:00    239.93         49434    17-Jul-2003 01:12:00    "fire"           
    "MidWest"      27-Sep-2004 11:09:00    286.72         66104    27-Sep-2004 16:37:00    "equipment fault"
    "SouthEast"    05-Sep-2004 17:48:00    73.387         36073    05-Sep-2004 20:46:00    "equipment fault"
    "West"         21-May-2004 21:45:00    159.99           NaN    22-May-2004 04:23:00    "equipment fault"
    "SouthEast"    01-Sep-2002 18:22:00    95.917         36759    01-Sep-2002 19:12:00    "severe storm"   
    "SouthEast"    27-Sep-2003 07:32:00       NaN    3.5517e+05    04-Oct-2003 07:02:00    "severe storm"   
    "West"         12-Nov-2003 06:12:00    254.09    9.2429e+05    17-Nov-2003 02:04:00    "winter storm"   
    "NorthEast"    18-Sep-2004 05:54:00         0             0                     NaT    "equipment fault"
      ⋮

Create categorical Variables for Grouped Calculations

Table variables can have any data type. But conceptually, you can also think of tables as having two general kinds of variables: data variables and grouping variables.

  • Data variables enable you to describe individual events or observations. For example, in outages you can think of the OutageTime, Loss, Customers, and RestorationTime variables as data variables.

  • Grouping variables enable you to group together events or observations that have something in common. For example, in outages you can think of the Region and Cause variables as grouping variables. You can group together and analyze the power outages that occur in the same region or share the same cause.

Often, grouping variables contain a discrete set of fixed values that specify categories. The categories specify groups that data values can belong to. The categorical data type can be a convenient type for working with categories.

To convert Region and Cause to categorical variables, use the convertvars function.

outages = convertvars(outages,["Region","Cause"],"categorical")
outages=1468×6 table
     Region           OutageTime          Loss     Customers       RestorationTime            Cause     
    _________    ____________________    ______    __________    ____________________    _______________

    SouthWest    01-Feb-2002 12:18:00    458.98    1.8202e+06    07-Feb-2002 16:50:00    winter storm   
    SouthEast    23-Jan-2003 00:49:00    530.14    2.1204e+05                     NaT    winter storm   
    SouthEast    07-Feb-2003 21:15:00     289.4    1.4294e+05    17-Feb-2003 08:14:00    winter storm   
    West         06-Apr-2004 05:44:00    434.81    3.4037e+05    06-Apr-2004 06:10:00    equipment fault
    MidWest      16-Mar-2002 06:18:00    186.44    2.1275e+05    18-Mar-2002 23:23:00    severe storm   
    West         18-Jun-2003 02:49:00         0             0    18-Jun-2003 10:54:00    attack         
    West         20-Jun-2004 14:39:00    231.29           NaN    20-Jun-2004 19:16:00    equipment fault
    West         06-Jun-2002 19:28:00    311.86           NaN    07-Jun-2002 00:51:00    equipment fault
    NorthEast    16-Jul-2003 16:23:00    239.93         49434    17-Jul-2003 01:12:00    fire           
    MidWest      27-Sep-2004 11:09:00    286.72         66104    27-Sep-2004 16:37:00    equipment fault
    SouthEast    05-Sep-2004 17:48:00    73.387         36073    05-Sep-2004 20:46:00    equipment fault
    West         21-May-2004 21:45:00    159.99           NaN    22-May-2004 04:23:00    equipment fault
    SouthEast    01-Sep-2002 18:22:00    95.917         36759    01-Sep-2002 19:12:00    severe storm   
    SouthEast    27-Sep-2003 07:32:00       NaN    3.5517e+05    04-Oct-2003 07:02:00    severe storm   
    West         12-Nov-2003 06:12:00    254.09    9.2429e+05    17-Nov-2003 02:04:00    winter storm   
    NorthEast    18-Sep-2004 05:54:00         0             0                     NaT    equipment fault
      ⋮

Calculate Statistics by Group in Table

You can calculate statistics by group in a table using functions such as groupsummary, varfun, and splitapply. These functions enable you to specify groups of data within a table and methods that perform calculations on each group. You can store the results in another table or in output arrays.

For example, determine the mean power loss and customers affected due to the outages in each region in the outages table. The recommended way to perform this calculation is to use the groupsummary function. Specify Region as the grouping variable, mean as the method to apply to each group, and Loss and Customers as the data variables. The output lists the regions (in the Region variable), the number of power outages per region (in the GroupCount variable), and the mean power loss and customers affected in each region (in the mean_Loss and mean_Customers variables, respectively).

meanLossByRegion = groupsummary(outages,"Region","mean",["Loss","Customers"])
meanLossByRegion=5×4 table
     Region      GroupCount    mean_Loss    mean_Customers
    _________    __________    _________    ______________

    MidWest         142         1137.7        2.4015e+05  
    NorthEast       557         551.65        1.4917e+05  
    SouthEast       389         495.35        1.6776e+05  
    SouthWest        26         493.88        2.6975e+05  
    West            354         433.37        1.5201e+05  

The groupsummary function is recommended for several reasons:

  • You can specify many common methods (such as max, min, and mean) by name, without using function handles.

  • You can specify multiple methods in one call.

  • NaNs, NaTs, and other missing values in the data variables are automatically omitted when calculating results.

The third point explains why the mean_Loss and mean_Customers variables do not have NaNs in the meanLossByRegion output table.

To specify multiple methods in one call to groupsummary, list them in an array. For example, calculate the maximum, mean, and minimum power loss by region.

lossStatsByRegion = groupsummary(outages,"Region",["max","mean","min"],"Loss")
lossStatsByRegion=5×5 table
     Region      GroupCount    max_Loss    mean_Loss    min_Loss
    _________    __________    ________    _________    ________

    MidWest         142          23141      1137.7         0    
    NorthEast       557          23418      551.65         0    
    SouthEast       389         8767.3      495.35         0    
    SouthWest        26           2796      493.88         0    
    West            354          16659      433.37         0    

The minimum loss in every region is zero. To analyze only those outages that resulted in losses greater than zero, exclude the rows in outages where the loss is zero. First create a vector of logical indices whose values are logical 1 (true) for rows where outages.Loss is greater than zero. Then index into outages to return a table that includes only those rows. Again, calculate the maximum, mean, and minimum power loss by region.

nonZeroLossIndices = outages.Loss > 0;
nonZeroLossOutages = outages(nonZeroLossIndices,:);
nonZeroLossStats = groupsummary(nonZeroLossOutages,"Region",["max","mean","min"],"Loss")
nonZeroLossStats=5×5 table
     Region      GroupCount    max_Loss    mean_Loss    min_Loss
    _________    __________    ________    _________    ________

    MidWest          81          23141      1264.1       8.9214 
    NorthEast       180          23418      827.47      0.74042 
    SouthEast       234         8767.3      546.16       2.3096 
    SouthWest        23           2796      515.35       27.882 
    West            175          16659      549.76      0.71847 

Use Alternative Functions for Grouped Calculations

There are alternative functions that perform grouped calculations in tables. While groupsummary is recommended, the alternative functions are also useful in some situations.

  • The varfun function performs calculations on variables. It is similar to groupsummary, but varfun can perform both grouped and ungrouped calculations.

  • The rowfun function performs calculations along rows. You can specify methods that take multiple inputs or that return multiple outputs.

  • The findgroups and splitapply functions can perform calculations on variables or along rows. You can specify methods that take multiple inputs or that return multiple outputs. The outputs of splitapply are arrays, not tables.

Call varfun on Variables

For example, calculate the maximum power loss by region using varfun. The output table has a similar format to the output of groupsummary.

maxLossByVarfun = varfun(@max, ...
                         outages, ...
                         "InputVariables","Loss", ...
                         "GroupingVariables","Region")
maxLossByVarfun=5×3 table
     Region      GroupCount    max_Loss
    _________    __________    ________

    MidWest         142          23141 
    NorthEast       557          23418 
    SouthEast       389         8767.3 
    SouthWest        26           2796 
    West            354          16659 

However, there are significant differences when you use varfun:

  • You must always specify the method by using a function handle.

  • You can specify only one method.

  • You can perform grouped or ungrouped calculations.

  • NaNs, NaTs, and other missing values in the data variables are automatically included when calculating results.

The last point is a significant difference in behavior between groupsummary and varfun. For example, the Loss variable has NaNs. If you use varfun to calculate the mean losses, then by default the results are NaNs, unlike the default groupsummary results.

meanLossByVarfun = varfun(@mean, ...
                          outages, ...
                          "InputVariables","Loss", ...
                          "GroupingVariables","Region")
meanLossByVarfun=5×3 table
     Region      GroupCount    mean_Loss
    _________    __________    _________

    MidWest         142           NaN   
    NorthEast       557           NaN   
    SouthEast       389           NaN   
    SouthWest        26           NaN   
    West            354           NaN   

To omit missing values when using varfun, wrap the method in an anonymous function so that you can specify the "omitnan" option.

omitnanMean = @(x)(mean(x,"omitnan"));

meanLossOmitNaNs = varfun(omitnanMean, ...
                          outages, ...
                          "InputVariables","Loss", ...
                          "GroupingVariables","Region")
meanLossOmitNaNs=5×3 table
     Region      GroupCount    Fun_Loss
    _________    __________    ________

    MidWest         142         1137.7 
    NorthEast       557         551.65 
    SouthEast       389         495.35 
    SouthWest        26         493.88 
    West            354         433.37 

Another point refers to a different but related use case, which is to perform ungrouped calculations on table variables. To apply a method to all table variables without grouping, use varfun. For example, calculate the maximum power loss and the maximum number of customers affected in the entire table.

maxValuesInOutages = varfun(@max, ...
                            outages, ...
                            "InputVariables",["Loss","Customers"])
maxValuesInOutages=1×2 table
    max_Loss    max_Customers
    ________    _____________

     23418       5.9689e+06  

Call rowfun on Rows

The rowfun function applies a method along the rows of a table. Where varfun applies a method to each specified variable, one by one, rowfun takes all specified table variables as input arguments to the method and applies the method once.

For example, calculate the median loss per customer in each region. To perform this calculation, first specify a function that takes two input arguments (loss and customers), divides the loss by the number of customers, and then returns the median.

medianLossCustFcn = @(loss,customers)(median(loss ./ customers,"omitnan"));

Then call rowfun. You can specify a meaningful output variable name by using the OutputVariablesNames name-value argument.

meanLossPerCustomer = rowfun(medianLossCustFcn, ...
                             outages, ...
                             "InputVariables",["Loss","Customers"], ...
                             "GroupingVariables","Region", ...
                             "OutputVariableNames","MedianLossPerCustomer")
meanLossPerCustomer=5×3 table
     Region      GroupCount    MedianLossPerCustomer
    _________    __________    _____________________

    MidWest         142              0.0042139      
    NorthEast       557              0.0028512      
    SouthEast       389              0.0032057      
    SouthWest        26              0.0026353      
    West            354               0.002527      

You can also use rowfun when the method returns multiple outputs. For example, use bounds to calculate the minimum and maximum loss per region in one call to rowfun. The bounds function returns two output arguments.

boundsLossPerRegion = rowfun(@bounds, ...
                             outages, ...
                             "InputVariables","Loss", ...
                             "GroupingVariables","Region", ...
                             "OutputVariableNames",["MinLoss","MaxLoss"])
boundsLossPerRegion=5×4 table
     Region      GroupCount    MinLoss    MaxLoss
    _________    __________    _______    _______

    MidWest         142           0        23141 
    NorthEast       557           0        23418 
    SouthEast       389           0       8767.3 
    SouthWest        26           0         2796 
    West            354           0        16659 

Call findgroups and splitapply on Variables or Rows

You can use the findgroups function to define groups and then use splitapply to apply a method to each group. The findgroups function returns a vector of group numbers that identifies which group a row of data is part of. The splitapply function returns a numeric array of the outputs of the method applied to the groups.

For example, calculate the maximum power loss by region using findgroups and splitapply.

G = findgroups(outages.Region)
G = 1468×1

     4
     3
     3
     5
     1
     5
     5
     5
     2
     1
      ⋮

maxLossArray = splitapply(@max,outages.Loss,G)
maxLossArray = 5×1
104 ×

    2.3141
    2.3418
    0.8767
    0.2796
    1.6659

Like rowfun, splitapply enables you to specify methods that return multiple outputs. Calculate both minima and maxima by using bounds.

[minLossArray,maxLossArray] = splitapply(@bounds,outages.Loss,G)
minLossArray = 5×1

     0
     0
     0
     0
     0

maxLossArray = 5×1
104 ×

    2.3141
    2.3418
    0.8767
    0.2796
    1.6659

You can also specify methods that take multiple inputs. For example, use the medianLossCustFcn function again to calculate the median loss per customer. But this time, return the median loss per customer in each region as an array.

medianLossCustFcn = @(loss,customers)(median(loss ./ customers,"omitnan"));

medianLossArray = splitapply(medianLossCustFcn,outages.Loss,outages.Customers,G)
medianLossArray = 5×1

    0.0042
    0.0029
    0.0032
    0.0026
    0.0025

The numeric outputs of findgroups and splitapply are not annotated like the output of groupsummary. However, returning numeric outputs can have other benefits:

  • You can use the output of findgroups in multiple calls to splitapply. You might want to use findgroups and splitapply for efficiency when you make many grouped calculations on a large table.

  • You can create a results table with a different format by building it from the outputs of findgroups and splitapply.

  • You can call methods that return multiple outputs.

  • You can append the outputs of splitapply to an existing table.

Append New Calculation to Existing Table

If you already have a table of results, you can append the results of another calculation to that table. For example, calculate the mean duration of power outages in each region in hours. Append the mean durations as a new variable to the lossStatsByRegion table.

First subtract the outage times from the restoration times to return the durations of the power outages. Convert these durations to hours by using the hours function.

D = outages.RestorationTime - outages.OutageTime;
H = hours(D)
H = 1468×1
105 ×

    0.0015
       NaN
    0.0023
    0.0000
    0.0007
    0.0001
    0.0000
    0.0001
    0.0001
    0.0001
      ⋮

Next use mean to calculate the mean durations. The outage durations have some NaN values because the outage and restoration times have some missing values. As before, wrap the method in an anonymous function to specify the "omitnan" option.

omitnanMean = @(x)(mean(x,"omitnan"));

Calculate the mean duration of power outages by region. Append it to lossStatsByRegion as a new table variable.

G = findgroups(outages.Region);
lossStatsByRegion.mean_Outage = splitapply(omitnanMean,H,G)
lossStatsByRegion=5×6 table
     Region      GroupCount    max_Loss    mean_Loss    min_Loss    mean_Outage
    _________    __________    ________    _________    ________    ___________

    MidWest         142          23141      1137.7         0          819.25   
    NorthEast       557          23418      551.65         0          581.04   
    SouthEast       389         8767.3      495.35         0           40.83   
    SouthWest        26           2796      493.88         0          59.519   
    West            354          16659      433.37         0          673.45   

Specify Groups as Bins

There is another way to specify groups. Instead of specifying categories as unique values in a grouping variable, you can bin values in a variable where values are distributed continuously. Then you can use those bins to specify groups.

For example, bin the power outages by year. To count the number of power outages per year, use the groupcounts function.

outagesByYear = groupcounts(outages,"OutageTime","year")
outagesByYear=13×3 table
    year_OutageTime    GroupCount    Percent
    _______________    __________    _______

         2002              36         2.4523
         2003              62         4.2234
         2004              79         5.3815
         2005              74         5.0409
         2006             108         7.3569
         2007              91         6.1989
         2008             115         7.8338
         2009             142          9.673
         2010             177         12.057
         2011             190         12.943
         2012             207         14.101
         2013             186          12.67
         2014               1        0.06812

Visualize the number of outages per year. The number per year increases over time in this data set.

bar(outagesByYear.year_OutageTime,outagesByYear.GroupCount)

Figure contains an axes object. The axes object contains an object of type bar.

You can use groupsummary with bins as groups. For example, calculate the median values for customers affected and power losses by year.

medianLossesByYear = groupsummary(outages,"OutageTime","year","median",["Customers","Loss"])
medianLossesByYear=13×4 table
    year_OutageTime    GroupCount    median_Customers    median_Loss
    _______________    __________    ________________    ___________

         2002              36           1.7101e+05         277.02   
         2003              62           1.0204e+05          295.6   
         2004              79           1.0108e+05         252.44   
         2005              74                91536         265.16   
         2006             108                86020         210.08   
         2007              91           1.0529e+05         232.12   
         2008             115                86356         205.77   
         2009             142                63119         83.491   
         2010             177                66212         155.76   
         2011             190                48200         75.286   
         2012             207                66994         78.289   
         2013             186                55669         69.596   
         2014               1                  NaN            NaN   

Visualize the median number of customers affected by outages per year. Although the number of outages increased over time, the median number of affected customers decreased.

plot(medianLossesByYear,"year_OutageTime","median_Customers")

Figure contains an axes object. The axes object with xlabel year _ OutageTime, ylabel median _ Customers contains an object of type line.

Return the rows of outages for years with more than 75 outages. To index into outages by those years, use the groupfilter function. To find the bins with more than 75 rows, specify an anonymous function that returns a logical 1 if the number of rows in a bin is greater than 75.

outages75 = groupfilter(outages,"OutageTime","year",@(x) numel(x) > 75)
outages75=1295×7 table
     Region           OutageTime          Loss     Customers       RestorationTime            Cause         year_OutageTime
    _________    ____________________    ______    __________    ____________________    _______________    _______________

    West         06-Apr-2004 05:44:00    434.81    3.4037e+05    06-Apr-2004 06:10:00    equipment fault         2004      
    West         20-Jun-2004 14:39:00    231.29           NaN    20-Jun-2004 19:16:00    equipment fault         2004      
    MidWest      27-Sep-2004 11:09:00    286.72         66104    27-Sep-2004 16:37:00    equipment fault         2004      
    SouthEast    05-Sep-2004 17:48:00    73.387         36073    05-Sep-2004 20:46:00    equipment fault         2004      
    West         21-May-2004 21:45:00    159.99           NaN    22-May-2004 04:23:00    equipment fault         2004      
    NorthEast    18-Sep-2004 05:54:00         0             0                     NaT    equipment fault         2004      
    NorthEast    13-Nov-2004 10:42:00       NaN    1.4227e+05    19-Nov-2004 02:31:00    winter storm            2004      
    SouthEast    06-Dec-2004 23:18:00       NaN         37136    14-Dec-2004 03:21:00    winter storm            2004      
    West         21-Dec-2004 18:50:00    112.05     7.985e+05    29-Dec-2004 03:46:00    winter storm            2004      
    NorthEast    26-Dec-2004 22:18:00    255.45    1.0444e+05    27-Dec-2004 14:11:00    winter storm            2004      
    SouthWest    06-Jun-2004 05:27:00    559.41      2.19e+05    06-Jun-2004 05:55:00    equipment fault         2004      
    MidWest      02-Jul-2004 09:16:00     15128    2.0104e+05    06-Jul-2004 14:11:00    thunder storm           2004      
    SouthWest    18-Jul-2004 14:40:00    340.35    1.4963e+05    26-Jul-2004 23:34:00    severe storm            2004      
    NorthEast    16-Sep-2004 19:42:00      4718           NaN                     NaT    unknown                 2004      
    SouthEast    20-Sep-2004 12:37:00    8767.3    2.2249e+06    02-Oct-2004 06:00:00    severe storm            2004      
    MidWest      09-Nov-2004 18:44:00    470.83         67587    09-Nov-2004 21:24:00    wind                    2004      
      ⋮

Summary of Behavior and Recommendations

Use these tips and recommendations to decide which functions to use to perform group calculations.

  • Specify groups using either grouping variables or bins created from numeric, datetime, or duration variables.

  • To perform calculations by group on data in tables or timetables, use the recommended function groupsummary. The related functions groupcounts, groupfilter, and grouptransform also are useful.

  • Consider using varfun to automatically include missing values (such as NaNs and NaTs) when applying methods to groups of data. Also, varfun can perform both grouped and ungrouped calculations.

  • Consider using findgroups and splitapply for efficiency when you make many consecutive grouped calculations on a large table.

  • Consider using findgroups and splitapply to append new arrays to an existing table of results.

  • To perform calculations using a method that returns multiple outputs, such as bounds, use either rowfun or splitapply.

  • To perform calculations along rows using a method that requires multiple input arguments, use either rowfun or splitapply.

See Also

| | | | | | | | | | | | | |

Related Topics