Statistical Arrays

Introduction

As discussed in MATLAB® Arrays, MATLAB® data types include arrays for numerical, logical, and character data, as well as cell and structure arrays for heterogeneous collections of data.

Statistics Toolbox™ software offers two additional types of arrays specifically designed for statistical data:

Categorical arrays store data with values in a discrete set of levels. Each level is meant to capture a single, defining characteristic of an observation. If no ordering is encoded in the levels, the data and the array are nominal. If an ordering is encoded, the data and the array are ordinal.

Categorical arrays also store labels for the levels. Nominal labels typically suggest the type of an observation, while ordinal labels suggest the position or rank.

Dataset arrays collect heterogeneous statistical data and metadata, including categorical data, into a single container variable. Like the numerical matrices discussed in Numerical Data, dataset arrays can be viewed as tables of values, with rows representing different observations and columns representing different measured variables. Like the cell and structure arrays discussed in Heterogeneous Data, dataset arrays can accommodate variables of different types, sizes, units, etc.

Dataset arrays combine the organizational advantages of these basic MATLAB data types while addressing their shortcomings with respect to storing complex statistical data.

Both categorical and dataset arrays have associated methods for assembling, accessing, manipulating, and processing the collected data. Basic array operations parallel those for numerical, cell, and structure arrays.

Categorical Arrays

Categorical Data

Categorical data take on values from only a finite, discrete set of categories or levels. Levels may be determined before the data are collected, based on the application, or they may be determined by the distinct values in the data when converting them to categorical form. Predetermined levels, such as a set of states or numerical intervals, are independent of the data they contain. Any number of values in the data may attain a given level, or no data at all. Categorical data show which measured values share common levels, and which do not.

Levels may have associated labels. Labels typically express a defining characteristic of an observation, captured by its level.

If no ordering is encoded in the levels, the data are nominal. Nominal labels typically indicate the type of an observation. Examples of nominal labels are {false, true}, {male, female}, and {Afghanistan, ... , Zimbabwe}. For nominal data, the numeric or lexicographic order of the labels is irrelevant—Afghanistan is not considered to be less than, equal to, or greater than Zimbabwe.

If an ordering is encoded in the levels—for example, if levels labeled "red", "green", and "blue" represent wavelengths—the data are ordinal. Labels for ordinal levels typically indicate the position or rank of an observation. Examples of ordinal labels are {0, 1}, {mm, cm, m, km}, and {poor, satisfactory, outstanding}. The ordering of the levels may or may not correspond to the numeric or lexicographic order of the labels.

Categorical Arrays

Categorical data can be represented using MATLAB integer arrays, but this method has a number of drawbacks. First, it removes all of the useful metadata that might be captured in labels for the levels. Labels must be stored separately, in character arrays or cell arrays of strings. Secondly, this method suggests that values stored in the integer array have their usual numeric meaning, which, for categorical data, they may not. Finally, integer types have a fixed set of levels (for example, -128:127 for all int8 arrays), which cannot be changed.

Categorical arrays, available in Statistics Toolbox software, are specifically designed for storing, manipulating, and processing categorical data and metadata. Unlike integer arrays, each categorical array has its own set of levels, which can be changed. Categorical arrays also accommodate labels for levels in a natural way. Like numerical arrays, categorical arrays take on different shapes and sizes, from scalars to N-D arrays.

Organizing data in a categorical array can be an end in itself. Often, however, categorical arrays are used for further statistical processing. They can be used to index into other variables, creating subsets of data based on the category of observation, or they can be used with statistical functions that accept categorical inputs. For examples, see Grouped Data.

Categorical arrays come in two types, depending on whether the collected data is understood to be nominal or ordinal. Nominal arrays are constructed with nominal; ordinal arrays are constructed with ordinal. For example,

load fisheriris
ndata = nominal(species,{'A','B','C'});

creates a nominal array with levels A, B, and C from the species data in fisheriris.mat, while

odata = ordinal(ndata,{},{'C','A','B'});

encodes an ordering of the levels with C < A < B. See Using Categorical Arrays, and the reference pages for nominal and ordinal, for further examples.

Categorical arrays are implemented as objects of the @categorical class. The class is abstract, defining properties and methods common to both the @nominal and @ordinal classes. Use the corresponding constructors, nominal or ordinal, to create categorical arrays. Methods of the classes are used to display, summarize, convert, concatenate, and access the collected data. Many of these methods are invoked using operations analogous to those for numerical arrays, and do not need to be called directly (for example, [] invokes horzcat). Other methods, such as reorderlevels, must be called directly.

Using Categorical Arrays

This section provides an extended tutorial example demonstrating the use of categorical arrays with methods of the @nominal and @ordinal classes.

Constructing Categorical Arrays.   Load the 150-by-4 numerical array meas and the 150-by-1 cell array of strings species:

 load fisheriris % Fisher's iris data (1936)

The data are 150 observations of four measured variables (by column number: sepal length, sepal width, petal length, and petal width, respectively) over three species of iris (setosa, versicolor, and virginica).

Use nominal to create a nominal array from species:

 n1 = nominal(species);

Open species and n1 side by side in the Variable Editor (see Viewing and Editing Workspace Variables with the Variable Editor). Note that the string information in species has been converted to categorical form, leaving only information on which data share the same values, indicated by the labels for the levels.

By default, levels are labeled with the distinct values in the data (in this case, the strings in species). Alternate labels are given with additional input arguments to the nominal constructor:

 n2 = nominal(species,{'species1','species2','species3'});

Open n2 in the Variable Editor, and compare it with species and n1. The levels have been relabeled.

Suppose that the data are considered to be ordinal. A characteristic of the data that is not reflected in the labels is the diploid chromosome count, which orders the levels corresponding to the three species as follows:

species1 < species3 < species2

Use ordinal to cast n2 as an ordinal array:

o1 = ordinal(n2,{},{'species1','species3','species2'});

The second input argument to ordinal is the same as for nominal—a list of labels for the levels in the data. If it is unspecified, as above, the labels are inherited from the data, in this case n2. The third input argument of ordinal indicates the ordering of the levels, in ascending order.

When displayed side by side in the Variable Editor, o1 does not appear any different than n2. This is because the data in o1 have not been sorted. It is important to recognize the difference between the ordering of the levels in an ordinal array and sorting the actual data according to that ordering. Use sort to sort ordinal data in ascending order:

o2 = sort(o1);

When displayed in the Variable Editor, o2 shows the data sorted by diploid chromosome count.

To find which elements moved up in the sort, use the < operator for ordinal arrays:

moved_up = (o1 < o2);

The operation returns a logical array moved_up, indicating which elements have moved up (the data for species3).

Use getlabels to display the labels for the levels in ascending order:

labels2 = getlabels(o2)
labels2 = 
    'species1'    'species3'    'species2'

The sort function reorders the display of the data, but not the order of the levels. To reorder the levels, use reorderlevels:

o3 = reorderlevels(o2,labels2([1 3 2]));
labels3 = getlabels(o3)
labels3 = 
    'species1'    'species2'    'species3'
o4 = sort(o3);

These operations return the levels in the data to their original ordering, by species number, and then sort the data for display purposes.

Accessing Categorical Arrays.   Categorical arrays are accessed using parenthesis indexing, with syntax that parallels similar operations for numerical arrays (see Numerical Data).

Parenthesis indexing on the right-hand side of an assignment is used to extract the lowest 50 elements from the ordinal array o4:

low50 = o4(1:50);

Suppose you want to categorize the data in o4 with only two levels: low (the data in low50) and high (the rest of the data). One way to do this is to use an assignment with parenthesis indexing on the left-hand side:

o5 = o4; % Copy o4
o5(1:50) = 'low';
Warning: Categorical level 'low' being added.
o5(51:end) = 'high';
Warning: Categorical level 'high' being added.

Note the warnings: the assignments move data to new levels. The old levels, though empty, remain:

getlabels(o5)
ans = 
    'species1' 'species2' 'species3' 'low' 'high'

The old levels are removed using droplevels:

o5 = droplevels(o5,{'species1','species2','species3'});

Another approach to creating two categories in o5 from the three categories in o4 is to merge levels, using mergelevels:

o5 = mergelevels(o4,{'species1'},'low');
o5 = mergelevels(o5,{'species2','species3'},'high');

getlabels(o5)
ans = 
    'low'    'high'

The merged levels are removed and replaced with the new levels.

Combining Categorical Arrays.   Categorical arrays are concatenated using square brackets. Again, the syntax parallels similar operations for numerical arrays (see Numerical Data). There are, however, restrictions:

First use ordinal to create ordinal arrays from the variables for sepal length and sepal width in meas. Categorize the data as short or long depending on whether they are below or above the median of the variable, respectively:

sl = meas(:,1); % Sepal length data
sw = meas(:,2); % Sepal width data
SL1 = ordinal(sl,{'short','long'},[],...
              [min(sl),median(sl),max(sl)]);
SW1 = ordinal(sw,{'short','long'},[],...
              [min(sw),median(sw),max(sw)]);

Because SL1 and SW1 are ordinal arrays with the same levels, in the same order, they can be concatenated:

S1 = [SL1,SW1];
S1(1:10,:)
ans = 
     short      long  
     short      long  
     short      long  
     short      long  
     short      long  
     short      long  
     short      long  
     short      long  
     short      short 
     short      long

The result is an ordinal array S1 with two columns.

If, on the other hand, the measurements are cast as nominal, different levels can be used for the different variables, and the two nominal arrays can still be combined:

SL2 = nominal(sl,{'short','long'},[],...
              [min(sl),median(sl),max(sl)]);
SW2 = nominal(sw,{'skinny','wide'},[],...
              [min(sw),median(sw),max(sw)]);
S2 = [SL2,SW2];
getlabels(S2)
ans = 
    'short' 'long' 'skinny' 'wide'
S2(1:10,:)
ans = 
     short      wide   
     short      wide   
     short      wide   
     short      wide   
     short      wide   
     short      wide   
     short      wide   
     short      wide   
     short      skinny 
     short      wide

Computing with Categorical Arrays.   Categorical arrays are used to index into other variables, creating subsets of data based on the category of observation, and they are used with statistical functions that accept categorical inputs, such as those described in Grouped Data.

Use ismember to create logical variables based on the category of observation. For example, the following creates a logical index the same size as species that is true for observations of iris setosa and false elsewhere. Recall that n1 = nominal(species):

SetosaObs = ismember(n1,'setosa');

Since the code above compares elements of n1 to a single value, the same operation is carried out by the equality operator:

SetosaObs = (n1 == 'setosa');

The SetosaObs variable is used to index into meas to extract only the setosa data:

SetosaData = meas(SetosaObs,:);

Categorical arrays are also used as grouping variables. The following plot summarizes the sepal length data in meas by category:

boxplot(sl,n1)

Dataset Arrays

Statistical Data

MATLAB data containers (variables) are suitable for completely homogeneous data (numeric, character, and logical arrays) and for completely heterogeneous data (cell and structure arrays). Statistical data, however, are often a mixture of homogeneous variables of heterogeneous types and sizes. Dataset arrays are suitable containers for this kind of data.

Dataset arrays can be viewed as tables of values, with rows representing different observations or cases and columns representing different measured variables. In this sense, dataset arrays are analogous to the numerical arrays for statistical data discussed in Numerical Data. Basic methods for creating and manipulating dataset arrays parallel the syntax of corresponding methods for numerical arrays.

While each column of a dataset array must be a variable of a single type, each row may contain an observation consisting of measurements of different types. In this sense, dataset arrays lie somewhere between variables that enforce complete homogeneity on the data and those that enforce nothing. Because of the potentially heterogeneous nature of the data, dataset arrays have indexing methods with syntax that parallels corresponding methods for cell and structure arrays (see Heterogeneous Data).

Dataset Arrays

Dataset arrays are variables created with dataset. For example, the following creates a dataset array from observations that are a combination of categorical and numerical measurements:

load fisheriris
NumObs = size(meas,1);
NameObs = strcat({'Obs'},num2str((1:NumObs)','%d'));
iris = dataset({nominal(species),'species'},...
               {meas,'SL','SW','PL','PW'},...
               'ObsNames',NameObs);
iris(1:5,:)
ans = 
            species    SL     SW     PL     PW 
    Obs1    setosa     5.1    3.5    1.4    0.2
    Obs2    setosa     4.9      3    1.4    0.2
    Obs3    setosa     4.7    3.2    1.3    0.2
    Obs4    setosa     4.6    3.1    1.5    0.2
    Obs5    setosa       5    3.6    1.4    0.2

When creating a dataset array, variable names and observation names can be assigned together with the data. Other metadata associated with the array can be assigned with set and accessed with get:

iris = set(iris,'Description','Fisher''s Iris Data');
get(iris)
   Description: 'Fisher's Iris Data'
   Units: {}
   DimNames: {'Observations' 'Variables'}
   UserData: []
   ObsNames: {150x1 cell}
   VarNames: {'species' 'SL' 'SW' 'PL' 'PW'}

Dataset arrays are implemented as objects of the @dataset class. Methods of the class are used to display, summarize, convert, concatenate, and access the collected data. Many of these methods are invoked using operations analogous to those for numerical arrays, and do not need to be called directly (for example, [] invokes horzcat). Other methods, such as sortrows, must be called directly.

Using Dataset Arrays

This section provides an extended tutorial example demonstrating the use of dataset arrays with methods of the @dataset class.

Constructing Dataset Arrays.   Load the 150-by-4 numerical array meas and the 150-by-1 cell array of strings species:

load fisheriris % Fisher's iris data (1936)

The data are 150 observations of four measured variables (by column number: sepal length, sepal width, petal length, and petal width, respectively) over three species of iris (setosa, versicolor, and virginica).

Use dataset to create a dataset array iris from the data, assigning variable names species, SL, SW, PL, and PW and observation names Obs1, Obs2, Obs3, etc.:

NumObs = size(meas,1);
NameObs = strcat({'Obs'},num2str((1:NumObs)','%d'));
iris = dataset({nominal(species),'species'},...
               {meas,'SL','SW','PL','PW'},...
               'ObsNames',NameObs);
iris(1:5,:)
ans = 
            species    SL     SW     PL     PW 
    Obs1    setosa     5.1    3.5    1.4    0.2
    Obs2    setosa     4.9      3    1.4    0.2
    Obs3    setosa     4.7    3.2    1.3    0.2
    Obs4    setosa     4.6    3.1    1.5    0.2
    Obs5    setosa       5    3.6    1.4    0.2

The cell array of strings species is first converted to a categorical array of type nominal before inclusion in the dataset array. For further information on categorical arrays, see Categorical Arrays.

Use set to set properties of the array:

desc = 'Fisher''s iris data (1936)';
units = [{''} repmat({'cm'},1,4)];
info = 'http://en.wikipedia.org/wiki/R.A._Fisher';

iris = set(iris,'Description',desc,...
                'Units',units,...
                'UserData',info);

Use get to view properties of the array:

get(iris)
    Description: 'Fisher's iris data (1936)'
          Units: {''  'cm'  'cm'  'cm'  'cm'}
       DimNames: {'Observations'  'Variables'}
       UserData: 'http://en.wikipedia.org/wiki/R.A._Fisher'
       ObsNames: {150x1 cell}
       VarNames: {'species'  'SL'  'SW'  'PL'  'PW'}

get(iris(1:5,:),'ObsNames')
ans = 
    'Obs1'
    'Obs2'
    'Obs3'
    'Obs4'
    'Obs5'

For a table of accessible properties of dataset arrays, with descriptions, see @dataset.

Accessing Dataset Arrays.   Dataset arrays support multiple types of indexing. Like the numerical matrices described in Numerical Data, parenthesis () indexing is used to access data subsets. Like the cell and structure arrays described in Heterogeneous Data, dot . indexing is used to access data variables and curly brace {} indexing is used to access data elements.

Use parenthesis indexing to assign a subset of the data in iris to a new dataset array iris1:

iris1 = iris(1:5,2:3)
iris1 = 
            SL     SW 
    Obs1    5.1    3.5
    Obs2    4.9      3
    Obs3    4.7    3.2
    Obs4    4.6    3.1
    Obs5      5    3.6  

Similarly, use parenthesis indexing to assign new data to the first variable in iris1:

iris1(:,1) = dataset([5.2;4.9;4.6;4.6;5])
iris1 = 
            SL     SW 
    Obs1    5.2    3.5
    Obs2    4.9      3
    Obs3    4.6    3.2
    Obs4    4.6    3.1
    Obs5      5    3.6

Variable and observation names can also be used to access data:

SepalObs = iris1({'Obs1','Obs3','Obs5'},'SL')
SepalObs = 
            SL 
    Obs1    5.1
    Obs3    4.7
    Obs5      5

Dot indexing is used to access variables in a dataset array, and can be combined with other indexing methods. For example, apply zscore to the data in SepalObs as follows:

ScaledSepalObs = zscore(iris1.SL([1 3 5]))
ScaledSepalObs =
    0.8006
   -1.1209
    0.3203

The following code extracts the sepal lengths in iris1 corresponding to sepal widths greater than 3:

BigSWLengths = iris1.SL(iris1.SW > 3)
BigSWLengths =
    5.2000
    4.6000
    4.6000
    5.0000

Dot indexing also allows entire variables to be deleted from a dataset array:

iris1.SL = []
iris1 = 
            SW 
    Obs1    3.5
    Obs2      3
    Obs3    3.2
    Obs4    3.1
    Obs5    3.6

Dynamic variable naming works for dataset arrays just as it does for structure arrays. For example, the units of the SW variable are changed in iris1 as follows:

varname = 'SW';
iris1.(varname) = iris1.(varname)*10
iris1 = 
            SW
    Obs1    35
    Obs2    30
    Obs3    32
    Obs4    31
    Obs5    36
iris1 = set(iris1,'Units',{'mm'});

Curly brace indexing is used to access individual data elements. The following are equivalent:

iris1{1,1}
ans =
    35

iris1{'Obs1','SW'}
ans =
    35

Combining Dataset Arrays.   Combine two dataset arrays into a single dataset array using square brackets:

SepalData = iris(:,{'SL','SW'});
PetalData = iris(:,{'PL','PW'});
newiris = [SepalData,PetalData];
size(newiris)
ans =
   150   4

For horizontal concatenation, as in the preceding example, the number of observations in the two dataset arrays must agree. Observations are matched up by name (if given), regardless of their order in the two data sets.

The following concatenates variables within a dataset array and then deletes the component variables:

newiris.SepalData = [newiris.SL,newiris.SW];
newiris.PetalData = [newiris.PL,newiris.PW];
newiris(:,{'SL','SW','PL','PW'}) = [];
size(newiris)
ans =
   150   2
size(newiris.SepalData)
ans =
   150   2

newiris is now a 150-by-2 dataset array containing two 150-by-2 numerical arrays as variables.

Vertical concatenation is also handled in a manner analogous to numerical arrays:

newobs = dataset({[5.3 4.2; 5.0 4.1],'PetalData'},...
                 {[5.5 2; 4.8 2.1],'SepalData'});
newiris = [newiris;newobs];
size(newiris)
ans =
   152     2

For vertical concatenation, as in the preceding example, the names of the variables in the two dataset arrays must agree. Variables are matched up by name, regardless of their order in the two data sets.

Expansion of variables is also accomplished using direct assignment to new rows:

newiris(153,:) = dataset({[5.1 4.0],'PetalData'},...
                         {[5.1 4.2],'SepalData'});

A different type of concatenation is performed by join, which takes the data in one dataset array and assigns it to the rows of another dataset array, based on matching values in a common key variable. For example, the following creates a dataset array with diploid chromosome counts for each species of iris:

snames = nominal({'setosa';'versicolor';'virginica'});
CC = dataset({snames,'species'},{[38;108;70],'cc'})
CC = 
    species       cc 
    setosa         38
    versicolor    108
    virginica      70

This data is broadcast to the rows of iris using join:

iris2 = join(iris,CC);
iris2([1 2 51 52 101 102],:)
ans = 
           species       SL     SW     PL     PW     cc 
 Obs1      setosa        5.1    3.5    1.4    0.2     38
 Obs2      setosa        4.9      3    1.4    0.2     38
 Obs51     versicolor      7    3.2    4.7    1.4    108
 Obs52     versicolor    6.4    3.2    4.5    1.5    108
 Obs101    virginica     6.3    3.3      6    2.5     70
 Obs102    virginica     5.8    2.7    5.1    1.9     70

Computing with Dataset Arrays.   Use summary to provide summary statistics for the component variables of a dataset array:

summary(newiris)
Fisher's iris data (1936)
SepalData: [153x2 double]
     min         4.3000           2 
     1st Q       5.1000      2.8000 
     median      5.8000           3 
     3rd Q       6.4000      3.3250 
     max         7.9000      4.4000 
PetalData: [153x2 double]
     min              1      0.1000 
     1st Q       1.6000      0.3000 
     median      4.4000      1.3000 
     3rd Q       5.1000      1.8000 
     max         6.9000      4.2000

To apply other statistical functions, use dot indexing to access relevant variables:

SepalMeans = mean(newiris.SepalData)
SepalMeans =
    5.8294    3.0503

The same result is obtained with datasetfun, which applies functions to dataset array variables:

means = datasetfun(@mean,newiris,'UniformOutput',false)
means = 
    [1x2 double]    [1x2 double]
SepalMeans = means{1}
SepalMeans =
    5.8294    3.0503

An alternative approach is to cast data in a dataset array as double and apply statistical functions directly. Compare the following two methods for computing the covariance of the length and width of the SepalData in newiris:

covs = datasetfun(@cov,newiris,'UniformOutput',false)
covs = 
    [2x2 double]    [2x2 double]
SepalCovs = covs{1}
SepalCovs =
    0.6835   -0.0373
   -0.0373    0.2054

SepalCovs = cov(double(newiris(:,1)))
SepalCovs =
    0.6835   -0.0373
   -0.0373    0.2054
  


 © 1984-2008- The MathWorks, Inc.    -   Site Help   -   Patents   -   Trademarks   -   Privacy Policy   -   Preventing Piracy   -   RSS