findgroups

Find groups and return group numbers

Syntax

G = findgroups(A)

G = findgroups(A1,...,AN)

[G,ID] =
findgroups(A)

[G,ID1,...,IDN] = findgroups(A1,...,AN)

G = findgroups(T)

[G,TID]
= findgroups(T)

Description

To split data into groups and apply a function to the groups, use the findgroups and splitapply functions together. For more information about calculations on groups of data, see Calculations on Groups of Data.

example

G = findgroups(A) returns G, a vector of group numbers created from the grouping variable A. The output argument G contains integer values from 1 to N, indicating N distinct groups for the N unique values in A. For example, if A is ["b","a","a","b"], then findgroups returns G as [2 1 1 2].

To use G to split groups of data out of other variables, pass it as an input argument to the splitapply function.

The findgroups function treats empty character vectors and NaN, NaT, and undefined categorical values in A as missing values and returns NaN as the corresponding elements of G.

example

G = findgroups(A1,...,AN) creates group numbers from A1,...,AN. The findgroups function defines groups as the unique combinations of values across A1,...,AN. For example, if A1 is ["a","a","b","b"] and A2 is [0 1 0 0], then findgroups(A1,A2) returns G as [1 2 3 3], because the combination "b" 0 occurs twice.

example

[G,ID] = findgroups(A) also returns the unique values for each group in ID. For example, if A is ["b","a","a","b"], then findgroups returns G as [2 1 1 2] and ID as ["a","b"]. The arguments A and ID are the same data type, but need not be the same size.

example

[G,ID1,...,IDN] = findgroups(A1,...,AN) also returns the unique values for each group across ID1,...,IDN. The values across ID1,...,IDN define the groups. For example, if A1 is ["a","a","b","b"] and A2 is [0 1 0 0], then findgroups(A1,A2) returns G as [1 2 3 3], and ID1 and ID2 as ["a","a","b"] and [0 1 0].

example

G = findgroups(T) returns G, a vector of group numbers created from the variables in table T. The findgroups function treats all the variables in T as grouping variables.

example

[G,TID] = findgroups(T) also returns TID, a table that contains the unique values for each group. TID contains the unique combinations of values across the variables of T. The variables in T and TID have the same names, but the tables need not have the same number of rows.

Examples

collapse all

Use Group Numbers to Split Data

Open Live Script

Use group numbers to split patient weight measurements into groups of weights for smokers and nonsmokers. Then calculate the mean weight for each group of patients.

Load patient data from the sample file patients.mat.

load patients
whos Smoker Weight

  Name          Size            Bytes  Class      Attributes

  Smoker      100x1               100  logical              
  Weight      100x1               800  double

Specify groups with findgroups. Each element of G is a group number that specifies which group a patient is in. Group 1 contains nonsmokers and group 2 contains smokers.

G = findgroups(Smoker)

Display the weights of the patients.

Weight

Weight = 100×1

   176
   163
   131
   133
   119
   142
   142
   180
   183
   132
      ⋮

Split the Weight array into two groups of weights using G. Apply the mean function. The mean weight of the nonsmokers is a bit less than the mean weight of the smokers.

meanWeights = splitapply(@mean,Weight,G)

meanWeights = 2×1

  149.9091
  161.9412

Use Two Grouping Variables to Split Data

Open Live Script

Calculate mean weights for groups of patients. In this case, group patients by their statuses as smokers or nonsmokers, and by the hospitals where they were seen. There are three hospitals in the data set, so there are six groups of patients.

Load hospital locations, smoker status, and weights for patients from the sample file patients.mat.

load patients
whos Location Smoker Weight

  Name            Size            Bytes  Class      Attributes

  Location      100x1             14208  cell                 
  Smoker        100x1               100  logical              
  Weight        100x1               800  double

Display the Location and Smoker arrays.

Location

Location = 100x1 cell
    {'County General Hospital'  }
    {'VA Hospital'              }
    {'St. Mary's Medical Center'}
    {'VA Hospital'              }
    {'County General Hospital'  }
    {'St. Mary's Medical Center'}
    {'VA Hospital'              }
    {'VA Hospital'              }
    {'St. Mary's Medical Center'}
    {'County General Hospital'  }
    {'County General Hospital'  }
    {'St. Mary's Medical Center'}
    {'VA Hospital'              }
    {'VA Hospital'              }
    {'St. Mary's Medical Center'}
    {'VA Hospital'              }
    {'St. Mary's Medical Center'}
    {'VA Hospital'              }
    {'County General Hospital'  }
    {'County General Hospital'  }
    {'VA Hospital'              }
    {'VA Hospital'              }
    {'VA Hospital'              }
    {'County General Hospital'  }
    {'County General Hospital'  }
    {'VA Hospital'              }
    {'VA Hospital'              }
    {'County General Hospital'  }
    {'County General Hospital'  }
    {'County General Hospital'  }
      ⋮

Smoker

Smoker = 100x1 logical array

   1
   0
   0
   0
   0
   0
   1
   0
   0
   0
      ⋮

Specify groups using locations and smoker status. G contains integers from one to six because there are six possible combinations of values from Smoker and Location.

G = findgroups(Location,Smoker)

Calculate the mean weight for each group. There is less variation by location than by status as a smoker.

meanWeights = splitapply(@mean,Weight,G)

meanWeights = 6×1

  150.1739
  159.8125
  146.8947
  158.4000
  152.0417
  165.9231

Use Group IDs from Second Output

Open Live Script

Calculate the mean weights for groups of patients and display the results in a table. To associate the mean weights with group IDs, use the second output argument from findgroups.

Load patient weights and smoker statuses from the sample file patients.mat.

load patients
whos Smoker Weight

  Name          Size            Bytes  Class      Attributes

  Smoker      100x1               100  logical              
  Weight      100x1               800  double

Specify groups using findgroups. The values in the output argument ID are labels for the groups that findgroups finds in the grouping variable.

[G,ID] = findgroups(Smoker)

ID = 2x1 logical array

   0
   1

Calculate the mean weights. Create a table that contains the mean weights.

meanWeight = splitapply(@mean,Weight,G);
T = table(ID,meanWeight,'VariableNames',["Smokers","Mean Weights"])

T=2×2 table
    Smokers    Mean Weights
    _______    ____________

     false        149.91   
     true         161.94

Use Group IDs from Two Grouping Variables

Open Live Script

Calculate mean weights for groups of patients and display the results in a table. In this case, group patients by their statuses as smokers or nonsmokers, and by the hospitals where they were seen.

Load hospital locations, smoker status, and weights for patients from the sample file patients.mat.

load patients
whos Location Smoker Weight

  Name            Size            Bytes  Class      Attributes

  Location      100x1             14208  cell                 
  Smoker        100x1               100  logical              
  Weight        100x1               800  double

Convert Location to a string array. Then specify groups using locations and smoker status. You can specify two group IDs as additional outputs because you specify two grouping variables as inputs. There are six possible combinations of locations and smoker status. Together ID1 and ID2 provide IDs for the six groups.

Location = string(Location);
[G,ID1,ID2] = findgroups(Location,Smoker)

ID1 = 6x1 string
    "County General Hospital"
    "County General Hospital"
    "St. Mary's Medical Center"
    "St. Mary's Medical Center"
    "VA Hospital"
    "VA Hospital"

ID2 = 6x1 logical array

   0
   1
   0
   1
   0
   1

Calculate the mean weight for each group.

meanWeights = splitapply(@mean,Weight,G)

meanWeights = 6×1

  150.1739
  159.8125
  146.8947
  158.4000
  152.0417
  165.9231

Create a table with the mean weight for each group of patients.

T = table(ID1,ID2,meanWeights,'VariableNames',["Hospital","Smoker","Mean Weight"])

T=6×3 table
             Hospital              Smoker    Mean Weight
    ___________________________    ______    ___________

    "County General Hospital"      false       150.17   
    "County General Hospital"      true        159.81   
    "St. Mary's Medical Center"    false       146.89   
    "St. Mary's Medical Center"    true         158.4   
    "VA Hospital"                  false       152.04   
    "VA Hospital"                  true        165.92

Group by Table Variables

Open Live Script

Calculate mean weights for patients using grouping variables that are in a table.

Load hospital locations and smoking statuses for 100 patients into a table.

load patients
T = table(Location,Smoker)

T=100×2 table
              Location               Smoker
    _____________________________    ______

    {'County General Hospital'  }    true  
    {'VA Hospital'              }    false 
    {'St. Mary's Medical Center'}    false 
    {'VA Hospital'              }    false 
    {'County General Hospital'  }    false 
    {'St. Mary's Medical Center'}    false 
    {'VA Hospital'              }    true  
    {'VA Hospital'              }    false 
    {'St. Mary's Medical Center'}    false 
    {'County General Hospital'  }    false 
    {'County General Hospital'  }    false 
    {'St. Mary's Medical Center'}    false 
    {'VA Hospital'              }    false 
    {'VA Hospital'              }    true  
    {'St. Mary's Medical Center'}    false 
    {'VA Hospital'              }    true  
      ⋮

Specify groups of patients using the Smoker and Location variables in T.

G = findgroups(T)

Calculate mean weights from the data array Weight.

meanWeights = splitapply(@mean,Weight,G)

meanWeights = 6×1

  150.1739
  159.8125
  146.8947
  158.4000
  152.0417
  165.9231

Group from Table and Create Output Table

Open Live Script

Create a table of mean weights for patients grouped by hospital location and status as a smoker or nonsmoker.

Load locations and smoking statuses for patients into a table. Convert Location to a string array.

load patients
Location = string(Location);
T = table(Location,Smoker)

T=100×2 table
             Location              Smoker
    ___________________________    ______

    "County General Hospital"      true  
    "VA Hospital"                  false 
    "St. Mary's Medical Center"    false 
    "VA Hospital"                  false 
    "County General Hospital"      false 
    "St. Mary's Medical Center"    false 
    "VA Hospital"                  true  
    "VA Hospital"                  false 
    "St. Mary's Medical Center"    false 
    "County General Hospital"      false 
    "County General Hospital"      false 
    "St. Mary's Medical Center"    false 
    "VA Hospital"                  false 
    "VA Hospital"                  true  
    "St. Mary's Medical Center"    false 
    "VA Hospital"                  true  
      ⋮

Specify groups of patients using the Location and Smoker variables in T. The output table TID identifies the groups.

[G,TID] = findgroups(T);
TID

TID=6×2 table
             Location              Smoker
    ___________________________    ______

    "County General Hospital"      false 
    "County General Hospital"      true  
    "St. Mary's Medical Center"    false 
    "St. Mary's Medical Center"    true  
    "VA Hospital"                  false 
    "VA Hospital"                  true

Calculate mean weights from the data array Weight. Append the mean weights to TID.

TID.meanWeight = splitapply(@mean,Weight,G)

TID=6×3 table
             Location              Smoker    meanWeight
    ___________________________    ______    __________

    "County General Hospital"      false       150.17  
    "County General Hospital"      true        159.81  
    "St. Mary's Medical Center"    false       146.89  
    "St. Mary's Medical Center"    true         158.4  
    "VA Hospital"                  false       152.04  
    "VA Hospital"                  true        165.92

Input Arguments

collapse all

`A` — Grouping variable
vector

Grouping variable, specified as a vector. The unique values in A identify groups. You can specify grouping variables using the data types listed in the table.

Values That Specify Groups	Data Type of Grouping Variable
Numbers	Numeric or logical vector
Text	String array or cell array of character vectors
Dates and times	`datetime`, `duration`, or `calendarDuration` vector
Categories	`categorical` vector
Bins	Vector of binned values, created by binning a continuous distribution of numeric, `datetime`, or `duration` values

`T` — Grouping variables
table

Grouping variables, specified as a table. findgroups treats each table variable as a separate grouping variable.

A table variable can be a numeric, logical, string, categorical, datetime, duration, or calendarDuration vector, or a cell array of character vectors.

Output Arguments

collapse all

`G` — Group numbers
vector of positive integers

Group numbers, returned as a vector of positive integers. For N groups identified in the grouping variables, every integer between 1 and N specifies a group. G contains NaN where any grouping variable contains a missing string, an empty character vector, a NaN, NaT, or undefined categorical value.

If the grouping variables are vectors, then G and the grouping variables all are the same size.
If the grouping variables are in a table, the length of G is equal to the number of rows of the table.

`ID` — Values that identify each group
vector of sorted unique values

Values that identify each group, returned as a vector of sorted unique values from the input argument A. The data type of ID is the same as the data type of A.

`TID` — Table of unique values that identify each group
table

The unique values that identify each group, returned as a table. The variables of TID have the sorted unique values from the corresponding variables of T. However, TID and T need not have the same number of rows.

More About

collapse all

Calculations on Groups of Data

In data analysis, you commonly perform calculations on groups of data. For such calculations, you split one or more data variables into groups of data, perform a calculation on each group, and combine the results into one or more output variables. You can specify the groups using one or more grouping variables. The unique values in the grouping variables define the groups that the corresponding values of the data variables belong to.

For example, the diagram shows a simple grouped calculation that splits a 6-by-1 numeric vector into two groups of data, calculates the mean of each group, and then combines the outputs into a 2-by-1 numeric vector. The 6-by-1 grouping variable has two unique values, AB and XYZ.

You can specify grouping variables that have numbers, text, dates and times, categories, or bins.

Extended Capabilities

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

This function supports tall arrays with the limitations:

Tall tables are not supported.
The order of the group numbers in G might be different compared to in-memory findgroups calculations.

For more information, see Tall Arrays for Out-of-Memory Data.

Thread-Based Environment
Run code in the background using MATLAB® `backgroundPool` or accelerate code with Parallel Computing Toolbox™ `ThreadPool`.

This function fully supports thread-based environments. For more information, see Run MATLAB Functions in Thread-Based Environment.

Version History

Introduced in R2015b

findgroups

Syntax

Description

Examples

Use Group Numbers to Split Data

Use Two Grouping Variables to Split Data

Use Group IDs from Second Output

Use Group IDs from Two Grouping Variables

Group by Table Variables

Group from Table and Create Output Table

Input Arguments

`A` — Grouping variable
vector

`T` — Grouping variables
table

Output Arguments

`G` — Group numbers
vector of positive integers

`ID` — Values that identify each group
vector of sorted unique values

`TID` — Table of unique values that identify each group
table

More About

Calculations on Groups of Data

Extended Capabilities

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

Thread-Based Environment
Run code in the background using MATLAB® `backgroundPool` or accelerate code with Parallel Computing Toolbox™ `ThreadPool`.

Version History

See Also

Topics

findgroups

Syntax

Description

Examples

Use Group Numbers to Split Data

Use Two Grouping Variables to Split Data

Use Group IDs from Second Output

Use Group IDs from Two Grouping Variables

Group by Table Variables

Group from Table and Create Output Table

Input Arguments

A — Grouping variable vector

T — Grouping variables table

Output Arguments

G — Group numbers vector of positive integers

ID — Values that identify each group vector of sorted unique values

TID — Table of unique values that identify each group table

More About

Calculations on Groups of Data

Extended Capabilities

Tall Arrays Calculate with arrays that have more rows than fit in memory.

Thread-Based Environment Run code in the background using MATLAB® backgroundPool or accelerate code with Parallel Computing Toolbox™ ThreadPool.

Version History

See Also

Topics

`A` — Grouping variable
vector

`T` — Grouping variables
table

`G` — Group numbers
vector of positive integers

`ID` — Values that identify each group
vector of sorted unique values

`TID` — Table of unique values that identify each group
table

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

Thread-Based Environment
Run code in the background using MATLAB® `backgroundPool` or accelerate code with Parallel Computing Toolbox™ `ThreadPool`.