grpstats

Summary statistics organized by group

Syntax

  • statarray = grpstats(tbl,groupvar) example
  • statarray = grpstats(tbl,groupvar,whichstats) example
  • statarray = grpstats(tbl,groupvar,whichstats,Name,Value) example
  • means = grpstats(X,group) example
  • [stats1,...,statsN] = grpstats(X,group,whichstats) example
  • [stats1,...,statsN] = grpstats(X,group,whichstats,'Alpha',alpha) example

Description

example

statarray = grpstats(tbl,groupvar) returns a table or dataset array with the means for the data groups specified in tbl determined by the values of the grouping variable or variables specified in groupvar.

  • If there is a single grouping variable, then there is a row in statarray for each value of the grouping variable. grpstats sorts the groups by order of appearance (if the grouping variable is a character array), in ascending numeric order (if the grouping variable is numeric), or in order of the levels (if the grouping variable is categorical).

  • If groupvar is a cell array of strings containing multiple grouping variable names, or a vector of column numbers, then there is a row in statarray for each observed unique combination of values of the grouping variables. grpstats sorts the groups by the values of the first grouping variable, then the second grouping variable, and so on.

  • If any variables in tbl (other than those specified in groupvar) are not numeric or logical arrays, then you must specify the names or column numbers of the numeric and logical variables for which you want to calculate means using the name-value pair argument, DataVars.

example

statarray = grpstats(tbl,groupvar,whichstats) returns the group values for the summary statistics types specified in whichstats.

example

statarray = grpstats(tbl,groupvar,whichstats,Name,Value) uses additional options specified by one or more Name,Value pair arguments.

example

means = grpstats(X,group) returns a column vector or matrix with the means of the groups of the data in the matrix or vector X determined by the values of the grouping variable or variables, group. The rows of means correspond to the grouping variable values.

  • If there is a single grouping variable, then there is a row in means for each value of the grouping variable. grpstats sorts the groups by order of appearance (if the grouping variable is a character array), in ascending numeric order (if the grouping variable is numeric), or in order of the levels (if the grouping variable is categorical).

  • If group is a cell array of grouping variables, then there is a row in means for each observed unique combination of values of the grouping variables. grpstats sorts the groups by the values of the first grouping variable, then the second grouping variable, and so on.

  • If X is a matrix, then means is a matrix with the same number of columns as X. Each column of means has the group means for the corresponding column of X.

example

[stats1,...,statsN] = grpstats(X,group,whichstats) returns column vectors or arrays with group values for the summary statistic types specified in whichstats.

example

[stats1,...,statsN] = grpstats(X,group,whichstats,'Alpha',alpha) specifies the significance level for confidence and prediction intervals.

example

grpstats(X,group,alpha) plots the means of the groups of data in the vector or matrix X determined by the values of the grouping variable, group. The grouping variable values are on the horizontal plot axis. Each group mean has 100×(1 – alpha)% confidence intervals.

  • If X is a matrix, then grpstats plots the means and confidence intervals for each column of X.

  • If group is a cell array of grouping variables, then grpstats plots the means and confidence intervals for the groups of data in X determined by the unique combinations of values of the grouping variables. For example, if there are two grouping variables, each with two values, there are four possible combinations of grouping variable values. The plot includes only the combinations of values that exist in the input grouping variables (not all possible combinations).

Examples

expand all

Dataset Array Summary Statistics Organized by Group

Load the sample data.

load('hospital')

The dataset array hospital has 100 observations and 7 variables.

Create a dataset array with only the variables Sex, Age, Weight, and Smoker.

ds = hospital(:,{'Sex','Age','Weight','Smoker'});

Sex is a nominal array, with levels Male and Female. The variables Age and Weight have numeric values, and Smoker has logical values.

Compute the mean for the numeric and logical arrays, Age, Weight, and Smoker, grouped by the levels in Sex.

statarray = grpstats(ds,'Sex')
statarray = 

              Sex       GroupCount    mean_Age    mean_Weight    mean_Smoker
    Female    Female    53            37.717      130.47         0.24528    
    Male      Male      47            38.915      180.53         0.44681    

statarray is a dataset array with two rows, corresponding to the levels in Sex. GroupCount is the number of observations in each group. The means of Age, Weight, and Smoker, grouped by Sex, are given in mean_Age, mean_Weight, and mean_Smoker.

Compute the mean for Age and Weight, grouped by the values in Smoker.

statarray = grpstats(ds,'Smoker','mean','DataVars',{'Age','Weight'})
statarray = 

         Smoker    GroupCount    mean_Age    mean_Weight
    0    false     66             37.97      149.91     
    1    true      34            38.882      161.94     

In this case, not all variables in ds (excluding the grouping variable, Smoker) are numeric or logical arrays; the variable Sex is a nominal array. When not all variables in the input dataset array are numeric or logical arrays, you must specify the variables for which you want to calculate summary statistics using DataVars.

Compute the minimum and maximum weight, grouped by the combinations of values in Sex and Smoker.

statarray = grpstats(ds,{'Sex','Smoker'},{'min','max'},...
                     'DataVars','Weight')
statarray = 

                Sex       Smoker    GroupCount    min_Weight    max_Weight
    Female_0    Female    false     40            111           147       
    Female_1    Female    true      13            115           146       
    Male_0      Male      false     26            158           194       
    Male_1      Male      true      21            164           202  

There are two unique values in Smoker and two levels in Sex, for a total of four possible combinations of values: Female Nonsmoker (Female_0), Female Smoker (Female_1), Male Nonsmoker (Male_0), and Male Smoker (Male_1).

Specify the names for the columns in the output.

statarray = grpstats(ds,{'Sex','Smoker'},{'min','max'},...
          'DataVars','Weight','VarNames',{'Gender','Smoker',...
					'GroupCount','LowestWeight','HighestWeight'})
statarray = 

                Gender    Smoker    GroupCount    LowestWeight    HighestWeight
    Female_0    Female    false     40            111             147          
    Female_1    Female    true      13            115             146          
    Male_0      Male      false     26            158             194          
    Male_1      Male      true      21            164             202      

Summary Statistics for a Dataset Array Without Grouping

Load the sample data.

load('hospital')

The dataset array hospital has 100 observations and 7 variables.

Create a dataset array with only the variables Age, Weight, and Smoker.

ds = hospital(:,{'Age','Weight','Smoker'});

The variables Age and Weight have numeric values, and Smoker has logical values.

Compute the mean, minimum, and maximum for the numeric and logical arrays, Age, Weight, and Smoker, with no grouping.

statarray = grpstats(ds,[],{'mean','min','max'})
statarray = 

           GroupCount    mean_Age    min_Age    max_Age    mean_Weight
    All    100           38.28       25         50         154        


           min_Weight    max_Weight    mean_Smoker    min_Smoker    max_Smoker
    All    111           202           0.34           false         true   

The observation name All indicates that all observations in ds were used to compute the summary statistics.

Group Means for a Matrix Using One or More Grouping Variables

Load the sample data.

load('carsmall')

All variables are measured for 100 cars. Origin is the country of origin for each car (France, Germany, Italy, Japan, Sweden, or USA). Cylinders has three unique values, 4, 6, and 8, indicating the number of cylinders in each car.

Calculate the mean acceleration, grouped by country of origin.

means = grpstats(Acceleration,Origin)
means =

   18.0500
   16.3778
   15.5000
   15.8867
   16.6000
   14.4377

means is a 6-by-1 vector of mean accelerations, where each value corresponds to a country of origin.

Calculate the mean acceleration, grouped by both country of origin and number of cylinders.

means = grpstats(Acceleration,{Origin,Cylinders})
means =

   18.0500
   16.3375
   16.7000
   15.5000
   15.9143
   15.5000
   16.6000
   17.0818
   16.5267
   11.6406

There are 18 possible combinations of grouping variable values because Origin has 6 unique values and Cylinders has 3 unique values. Only 10 of the possible combinations appear in the data, so means is a 10-by-1 vector of group means corresponding to the observed combinations of values.

Return the group names along with the mean acceleration for each group.

[means,grps] = grpstats(Acceleration,{Origin,Cylinders},...
                        {'mean','gname'})
means =

   17.0818
   16.5267
   11.6406
   18.0500
   15.9143
   15.5000
   16.3375
   16.7000
   16.6000
   15.5000


grps = 

    'USA'        '4'
    'USA'        '6'
    'USA'        '8'
    'France'     '4'
    'Japan'      '4'
    'Japan'      '6'
    'Germany'    '4'
    'Germany'    '6'
    'Sweden'     '4'
    'Italy'      '4'

The output grps shows the 10 observed combinations of grouping variable values. For example, the mean acceleration of 4-cylinder cars made in France is 18.05.

Multiple Summary Statistics for a Matrix Organized by Group

Load the sample data.

load('carsmall')

The variable Acceleration was measured for 100 cars. The variable Origin is the country of origin for each car (France, Germany, Italy, Japan, Sweden, or USA).

Return the minimum, median, and maximum acceleration, grouped by country of origin.

[grpMin,grpMed,grpMax,grp] = grpstats(Acceleration,Origin,...
                               {'min','median','max','gname'})
grpMin =

    8.0000
   15.3000
   13.9000
   12.2000
   15.7000
   15.5000


grpMed =

   14.7000
   17.5000
   15.7000
   15.3000
   16.6000
   15.5000


grpMax =

   22.2000
   21.9000
   18.2000
   24.6000
   17.5000
   15.5000


grp = 

    'USA'
    'France'
    'Japan'
    'Germany'
    'Sweden'
    'Italy'

The sample car with the lowest acceleration is made in the USA, and the sample car with the highest acceleration is made in Germany.

Plot Prediction Intervals for a New Observation in Each Group

Load the sample data.

load('carsmall')

The variable Weight was measured for 100 cars. The variable Model_Year has three unique values, 70, 76, and 82, which correspond to model years 1970, 1976, and 1982.

Calculate the mean weight and 90% prediction intervals for each model year.

[means,pred,grp] = grpstats(Weight,Model_Year,...
                      {'mean','predci','gname'},'Alpha',0.1);

Plot error bars showing the mean weight and 90% prediction intervals, grouped by model year. Label the horizontal axis with the group names.

ngrps = length(grp); % Number of groups

figure()
errorbar((1:ngrps)',means,pred(:,2)-means)
set(gca,'xtick',1:ngrps,'xticklabel',grp)
title('90% Prediction Intervals for Weight by Year')

Plot Group Means and Confidence Intervals

Load the sample data.

load('carsmall')

The variables Acceleration and Weight are the acceleration and weight values measured for 100 cars. The variable Cylinders is the number of cylinders in each car. The variable Model_Year has three unique values, 70, 76, and 82, which correspond to model years 1970, 1976, and 1982.

Plot mean acceleration, grouped by Cylinders, with 95% confidence intervals.

grpstats(Acceleration,Cylinders,0.05)

The mean acceleration for cars with 8 cylinders is significantly lower than for cars with 4 or 6 cylinders.

Plot mean acceleration and weight, grouped by Cylinders, and 95% confidence intervals. Scale the Weight values by 1000 so the means of Weight and Acceleration are the same order of magnitude.

grpstats([Acceleration,Weight/1000],Cylinders,0.05)

The average weight of cars increases with the number of cylinders, and the average acceleration decreases with the number of cylinders.

Plot mean acceleration, grouped by both Cylinders and Model_Year. Specify 95% confidence intervals.

grpstats(Acceleration,{Cylinders,Model_Year},0.05)

There are nine possible combinations of grouping variable values because there are three unique values in Cylinders and three unique values in Model_Year. The plot does not show 8-cylinder cars with model year 1982 because the data did not include this combination.

The mean acceleration of 8-cylinder cars made in 1976 is significantly larger than the mean acceleration of 8-cylinder cars made in 1970.

Input Arguments

expand all

tbl — Input datatable | dataset array

Input data, specified as a table or dataset array. tbl must include at least one variable that is a grouping variable.

Summary statistics can only be calculated for variables that have a numeric or logical data type. If any variables in tbl (other than the grouping variables) are not numeric or logical arrays, then use the name-value pair argument DataVars to specify the names or column numbers of the numeric and logical variables for which to calculate summary statistics.

groupvar — Identifiers for the grouping variablescell array of strings | vector of positive integers | logical vector | []

Identifiers for the grouping variables in the input data, tbl, specified as one of the following:

String or cell array of stringsNames of the grouping variables
Positive integer or vector of positive integersVariable numbers of the grouping variables
Vector of logical values with number of elements equal to the number of variables in tblLogical indicator with value true for grouping variables and false otherwise
[]No groups (returns summary statistics for all data)

Any variable that is identified by groupvar as a grouping variable must have a valid grouping variable data type: categorical array, logical or numeric vector, or cell array of strings.

For example, consider an input table, tbl, with six variables. The fourth variable is named Gender. To be a valid grouping variable, the data type of Gender might be a cell array of strings or a nominal array, with the unique values Male and Female. To specify the variable Gender as the grouping variable, you can use any of these syntaxes:

  • statarray = grpstats(tbl,'Gender')

  • statarray = grpstats(tbl,4)

  • statarray = grpstats(tbl,logical([0 0 0 1 0 0]))

Data Types: double | logical | cell | char

whichstats — Types of summary statisticsstring | function handle

Types of summary statistics to compute, specified as a string or function handle, or a cell array of strings and function handles. Use a cell array to specify multiple types of summary statistics.

Possible string values are:

'mean'Mean
'sem'Standard error of the mean
'numel'Count, or number, of non-NaN elements
'gname'Group name
'std'Standard deviation
'var'Variance
'min'Minimum
'max'Maximum
'range'Range
'meanci'95% confidence interval for the mean
'predci'95% prediction interval for a new observation

Example: [stat1,stat2] = grpstats(X,group,{'mean','sem'})

You can specify different significance levels for the 'meanci' and 'predci' options using the name-value pair argument, Alpha.

To specify other types of summary statistics, you can use function handles. You can use the handle to any function that accepts a column or matrix of data, and returns the same size output each time grpstats calls it (even if the output for some groups is empty).

If the function accepts a column of data, then the function can return either a scalar value, or an nvals-by-1 column vector for descriptive statistics of length nvals (for example, confidence intervals have length two). If the function accepts a matrix, it must either return a 1-by-ncols row vector, or an nvals-by-ncols matrix, where ncols is the number of columns in the input data matrix.

Example: [stat1,stat2,stat3] = grpstats(X,group,{'mean','std',@skewness})

For functions that do not compute column-wise statistics, specify the computation direction while specifying the function.

Example: stat1 = grpstats(X,group,@(x)sum(x,1))

Data Types: char | function_handle

alpha — Significance levelscalar value in the range (0,1)

Significance level, specified as a scalar value in the range (0,1).

  • When you specify 'meanci' or 'predci' in whichstats, you can use alpha to specify the significance level for the confidence or prediction intervals. If you specify alpha, then grpstats returns 100×(1 – alpha)% confidence or prediction intervals. If you do not specify alpha, then grpstats returns 95% intervals (alpha = 0.05).

  • Use alpha with the grpstats(X,group,alpha) syntax to plot group means and corresponding 100×(1 – alpha)% confidence intervals.

Data Types: double

X — Input datavector | matrix

Input data, specified as a vector or a matrix. If X is a matrix, then grpstats returns summary statistics for each column of X.

Data Types: double | single

group — Grouping variablecategorical array | logical or numeric vector | cell array of strings | []

Grouping variable, specified as a categorical array, logical or numeric vector, or cell array of strings. Each unique value in a grouping variable defines a group. grpstats groups data for summary statistics using the grouping variable values.

There must be a grouping variable value for each row of the input data X. Observations (rows) with the same value of the grouping variable are in the same group. Use [] to compute summary statistics for all data, without using groups.

For example, if Gender is a cell array of strings with values 'Male' and 'Female', you can use Gender as a grouping variable to summarize your data by gender.

You can also use more than one grouping variable to group data for summary statistics. In this case, specify a cell array of grouping variables.

For example, if Smoker is a logical vector with values 0 for nonsmokers and 1 for smokers, then specifying the cell array {Gender,Smoker} divides observations into four groups: Male Smoker, Male Nonsmoker, Female Smoker, and Female Nonsmoker. grpstats returns summary statistics only for the combinations of values that exist in the input grouping variables (not all possible combinations).

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'DataVars',[1,3,4],'Alpha',0.01 specifies that summary statistics be calculated for the 1st, 3rd, and 4th variables in a dataset array, with 99% confidence intervals.

'Alpha' — Significance level0.05 (default) | scalar value in the range (0,1)

Significance level for confidence and prediction intervals, specified as the comma-separated pair consisting of 'Alpha' and a scalar value in the range (0,1).

When you include 'meanci' or 'predci' in whichstats, you can use Alpha to specify the significance level for confidence or prediction intervals. If you specify the value α, then grpstats returns 100×(1 – α)% confidence or prediction intervals.

If you do not specify a value for Alpha, then grpstats returns 95% intervals (α = 0.05).

Example: 'Alpha',0.1

Data Types: double

'DataVars' — Variable names or columnscell array of strings | vector of positive integers | logical vector

Variable names or columns indicating which variables in the input data tbl you want to compute summary statistics for, specified as the comma-separated pair consisting of 'DataVars' and a cell array of strings, vector of positive integers, or a logical vector. Use a string to specify a variable name, a positive integer to specify a variable column number, or logical values to indicate which variables to include (true if you want to compute summary statistics, false otherwise).

You must specify DataVars if there are any variables in tbl (other than the grouping variables specified in groupvar) that are not numeric or logical arrays. Summary statistics can only be calculated for variables that have a numeric or logical data type.

Example: 'DataVars',{'Height','Weight'}

Data Types: double | cell | char

'VarNames' — Variable names for outputcell array of strings

Variable names for the output statarray, specified as the comma-separated pair consisting of 'VarNames' and a cell array of strings. By default, grpstats constructs output variable names by appending a prefix to the variable names from the input data tbl. This prefix corresponds to the summary statistic name.

Example: 'VarNames',{'Gender','GroupCount','MaleMean','FemaleMean'}

Data Types: cell

Output Arguments

expand all

statarray — Group summary statisticstable | dataset array

Group summary statistics, returned as a table or a dataset array. If tbl is a table, grpstats returns statarray as a table. If tbl is a dataset array, grpstats returns statarray as a dataset array.

statarray contains summary statistic values for the groups of data in tbl determined by the levels of the grouping variables specified by groupvar. There is a row in statarray for each observed value or combination of values in the variables specified by groupvar. The output statarray contains:

  • All grouping variables specified by groupvar.

  • The variable GroupCount, containing the number of observations in each group.

  • Group summary statistic values for all variables in tbl (other than those specified by groupvar), or for only the variables specified using DataVars.

The total number of variables in statarray is ngroupvars + 1 + ndatavars×nstats, where ngroupvars is the number of variables in groupvar, ndatavars is the number of variables for which summary statistics are computed, and nstats is the number of summary statistic types specified in whichstats.

grpstats assigns default names to the variables in statarray, unless you specify variable names using the name-value pair argument VarNames.

means — Group meanscolumn vector | array

Group means for the groups of data in the vector or matrix X determined by the levels of group, returned as an ngroups-by-ncols array. Here, ngroups is the number of unique values in the grouping variable, and ncols is the number of columns in X. If X is a vector, then means is a column vector.

stats1,...,statsN — Group summary statisticscolumn vectors | arrays

Group summary statistics for the groups of data in the vector or matrix X determined by the levels of group, returned as ngroups-by-ncols arrays. Here, ngroups is the number of unique values in the grouping variable, and ncols is the number of columns in X. You must specify an output argument for each type of summary statistic specified in whichstats.

If a summary statistic type in whichstats returns a value of length nvals (for example, a confidence interval is a descriptive statistic of length two), then the corresponding output argument is an ngroups-by-ncols-by-nvals array.

More About

expand all

Algorithms

  • grpstats treats NaNs as missing values, and removes them from the input data before calculating summary statistics.

  • grpstats ignores empty group names.

See Also

|

Was this topic helpful?