Skip to Main Content Skip to Search
Product Documentation

Grouped Data

What Are Grouping Variables?

Grouping variables are utility variables used to indicate which elements in a data set are to be considered together when computing statistics and creating visualizations. Typically, you use grouping variables for classification, where you give or try to infer the group of an observation. Grouping variables may be:

Grouping variables have the same length as the variables (columns) in a data set. Observations (rows) i and j are considered to be in the same group if the values of the corresponding grouping variable are identical at those indices. Grouping variables with multiple columns are used to specify different groups within multiple variables.

For example, the following command loads the 150-by-4 numerical array meas and the 150-by-1 cell array of strings species into the workspace:

 load fisheriris % Fisher's iris data (1936)

The data are 150 observations of four measured variables (by column number: sepal length, sepal width, petal length, and petal width, respectively) over three species of iris (setosa, versicolor, and virginica). To group the observations by species, the following are all acceptable (and equivalent) grouping variables:

group1 = species;          % Cell array of strings
group2 = grp2idx(species); % Numeric vector
group3 = char(species);    % Character array
group4 = nominal(species); % Categorical array

These grouping variables can be supplied as input arguments to any of the functions described in Functions for Grouped Data. Examples are given in Using Grouping Variables.

Representing Missing Data

Your data can have missing entries. For grouping variables, represent this missing data as in the table.

Data TypeMissing Entry
Numeric vectorNaN
Categorical vector<undefined>
Character arrayRow of spaces
Cell array of strings''
Logical vector(Cannot represent)

Group Definition

Each level of a grouping variable defines a group. The levels and the order of levels are decided as follows:

Some functions, such as grpstats, can take a cell array of several grouping variables (such as {G1 G2 G3}) to group the observations in the data set by each combination of the grouping variable levels. The order is decided first by the order of the first grouping variables, then by the order of the second grouping variable, and so on.

Functions for Grouped Data

The following table lists general Statistics Toolbox functions that accept a grouping variable group as an input argument. The grouping variable may be in the form of a numeric or logical vector, string array, cell array of strings, or categorical array, as described in What Are Grouping Variables?.

For a full description of the syntax of any particular function, and examples of its use, consult its reference page, linked from the table. Using Grouping Variables also includes examples.

FunctionBasic Syntax for Grouped Data
gplotmatrixgplotmatrix(x,y,group)
grp2idx[G,GN] = grp2idx(group)
grpstatsmeans = grpstats(X,group)
gscattergscatter(x,y,group)

Using Grouping Variables

This section provides an example demonstrating the use of grouping variables and associated functions. Grouping variables are introduced in What Are Grouping Variables?. A list of general functions accepting grouping variables appears in Functions for Grouped Data.

  1. Load the 150-by-4 numerical array meas and the 150-by-1 cell array of strings species:

     load fisheriris % Fisher's iris data (1936)

    The data are 150 observations of four measured variables (by column number: sepal length, sepal width, petal length, and petal width, respectively) over three species of iris (setosa, versicolor, and virginica).

  2. Compute some basic statistics for the data (median and interquartile range), by group, using the grpstats function:

    [order,number,group_median,group_iqr] = ...
       grpstats(meas,species,{'gname','numel',@median,@iqr})
    
    order = 
        'setosa'
        'versicolor'
        'virginica'
    
    number =
        50    50    50    50
        50    50    50    50
        50    50    50    50
    
    group_median =
        5.0000    3.4000    1.5000    0.2000
        5.9000    2.8000    4.3500    1.3000
        6.5000    3.0000    5.5500    2.0000
    
    group_iqr =
        0.4000    0.5000    0.2000    0.1000
        0.7000    0.5000    0.6000    0.3000
        0.7000    0.4000    0.8000    0.5000

    The statistics appear in 3-by-4 arrays, corresponding to the 3 groups (categories) and 4 variables in the data. The order of the groups appears in the group names in order.

  3. You can use grouping variables to create visualizations based on categories of observations. The following scatter plot, created with the gscatter function, shows the correlation between sepal length and sepal width in two species of iris. Use ismember to subset the two species from species:

    subset = ismember(species,{'setosa','versicolor'});
    scattergroup = species(subset);
    gscatter(meas(subset,1),...
             meas(subset,2),...
             scattergroup)
    xlabel('Sepal Length')
    ylabel('Sepal Width')

  


 © 1984-2012- The MathWorks, Inc.    -   Site Help   -   Patents   -   Trademarks   -   Privacy Policy   -   Preventing Piracy   -   RSS