| Contents | Index |
| On this page… |
|---|
Grouping variables are utility variables used to indicate which elements in a data set are to be considered together when computing statistics and creating visualizations. Typically, you use grouping variables for classification, where you give or try to infer the group of an observation. Grouping variables may be:
Numeric vectors
String arrays (also called character arrays)
Cell arrays of strings
Categorical arrays
Logical vectors
Grouping variables have the same length as the variables (columns) in a data set. Observations (rows) i and j are considered to be in the same group if the values of the corresponding grouping variable are identical at those indices. Grouping variables with multiple columns are used to specify different groups within multiple variables.
For example, the following command loads the 150-by-4 numerical array meas and the 150-by-1 cell array of strings species into the workspace:
load fisheriris % Fisher's iris data (1936)
The data are 150 observations of four measured variables (by column number: sepal length, sepal width, petal length, and petal width, respectively) over three species of iris (setosa, versicolor, and virginica). To group the observations by species, the following are all acceptable (and equivalent) grouping variables:
group1 = species; % Cell array of strings group2 = grp2idx(species); % Numeric vector group3 = char(species); % Character array group4 = nominal(species); % Categorical array
These grouping variables can be supplied as input arguments to any of the functions described in Functions for Grouped Data. Examples are given in Using Grouping Variables.
Your data can have missing entries. For grouping variables, represent this missing data as in the table.
| Data Type | Missing Entry |
|---|---|
| Numeric vector | NaN |
| Categorical vector | <undefined> |
| Character array | Row of spaces |
| Cell array of strings | '' |
| Logical vector | (Cannot represent) |
Each level of a grouping variable defines a group. The levels and the order of levels are decided as follows:
For a numeric vector or a logical vector G, the set of group levels is the distinct values of G. The order is the sorted order of the unique values.
For a cell array of strings or a character array G, the set of group levels is the distinct strings of G. The order for strings is the order of their first appearance in G.
For a categorical vector G, the set of group levels and their order match the output of the getlabels (G) method.
Some functions, such as grpstats, can take a cell array of several grouping variables (such as {G1 G2 G3}) to group the observations in the data set by each combination of the grouping variable levels. The order is decided first by the order of the first grouping variables, then by the order of the second grouping variable, and so on.
The following table lists general Statistics Toolbox functions that accept a grouping variable group as an input argument. The grouping variable may be in the form of a numeric or logical vector, string array, cell array of strings, or categorical array, as described in What Are Grouping Variables?.
For a full description of the syntax of any particular function, and examples of its use, consult its reference page, linked from the table. Using Grouping Variables also includes examples.
| Function | Basic Syntax for Grouped Data |
|---|---|
| gplotmatrix | gplotmatrix(x,y,group) |
| grp2idx | [G,GN] = grp2idx(group) |
| grpstats | means = grpstats(X,group) |
| gscatter | gscatter(x,y,group) |
This section provides an example demonstrating the use of grouping variables and associated functions. Grouping variables are introduced in What Are Grouping Variables?. A list of general functions accepting grouping variables appears in Functions for Grouped Data.
Load the 150-by-4 numerical array meas and the 150-by-1 cell array of strings species:
load fisheriris % Fisher's iris data (1936)
The data are 150 observations of four measured variables (by column number: sepal length, sepal width, petal length, and petal width, respectively) over three species of iris (setosa, versicolor, and virginica).
Compute some basic statistics for the data (median and interquartile range), by group, using the grpstats function:
[order,number,group_median,group_iqr] = ...
grpstats(meas,species,{'gname','numel',@median,@iqr})
order =
'setosa'
'versicolor'
'virginica'
number =
50 50 50 50
50 50 50 50
50 50 50 50
group_median =
5.0000 3.4000 1.5000 0.2000
5.9000 2.8000 4.3500 1.3000
6.5000 3.0000 5.5500 2.0000
group_iqr =
0.4000 0.5000 0.2000 0.1000
0.7000 0.5000 0.6000 0.3000
0.7000 0.4000 0.8000 0.5000The statistics appear in 3-by-4 arrays, corresponding to the 3 groups (categories) and 4 variables in the data. The order of the groups appears in the group names in order.
You can use grouping variables to create visualizations based on categories of observations. The following scatter plot, created with the gscatter function, shows the correlation between sepal length and sepal width in two species of iris. Use ismember to subset the two species from species:
subset = ismember(species,{'setosa','versicolor'});
scattergroup = species(subset);
gscatter(meas(subset,1),...
meas(subset,2),...
scattergroup)
xlabel('Sepal Length')
ylabel('Sepal Width')

![]() | Statistical Arrays | Descriptive Statistics | ![]() |
| © 1984-2012- The MathWorks, Inc. - Site Help - Patents - Trademarks - Privacy Policy - Preventing Piracy - RSS |