When performing regression analysis, it is common to include both continuous and categorical (quantitative and qualitative) predictor variables. When including a categorical independent variable, it is important not to input the variable as a numeric array. Numeric arrays have both order and magnitude. A categorical variable might have order (for example, an ordinal variable), but it does not have magnitude. Using a numeric array implies a known "distance" between the categories.
The appropriate way to include categorical predictors is as dummy indicator variables. An indicator variable has values 0 and 1. A categorical variable with c categories can be represented by c – 1 indicator variables.
For example, suppose you have a categorical variable with levels {Small,Medium,Large}
.
You can represent this variable using two dummy variables, as shown
in this figure.
In this example, X_{1} is
a dummy variable that has value 1 for the Medium
group,
and 0 otherwise. X_{2} is a
dummy variable that has value 1 for the Large
group,
and 0 otherwise. Together, these two variables represent the three
categories. Observations in the Small
group have
0s for both dummy variables.
The category represented by all 0s is the reference group. When you include the dummy variables in a regression model, the coefficients of the dummy variables are interpreted with respect to the reference group.
The regression fitting functions, fitlm
, fitglm
, and fitnlm
,
recognize categorical array inputs as categorical predictors. That
is, if you input your categorical predictor as a nominal
or ordinal
array,
the fitting function automatically creates the required dummy variables.
The first level returned by getlevels
is the reference
group. To use a different reference group, use reorderlevels
to
change the level order.
If there are c unique levels in the categorical array, then the fitting function estimates c – 1 regression coefficients for the categorical predictor.
Note:
The fitting functions use every level of the categorical array
returned by |
If you prefer to create your own dummy variable design matrix,
use dummyvar
. This function accepts a numeric or
categorical column vector, and returns a matrix of indicator variables.
The dummy variable design matrix has a column for every group, and
a row for every observation.
For example,
gender = nominal({'Male';'Female';'Female';'Male';'Female'}); dv = dummyvar(gender)
dv = 0 1 1 0 1 0 0 1 1 0
gender
, and two columns for
the unique groups, Female
and Male
.
Column order corresponds to the order of the levels in gender
.
For nominal arrays, the default order is ascending alphabetical.To use these dummy variables in a regression model, you must
either delete a column (to create a reference group), or fit a regression
model with no intercept term. For the gender example, only one dummy
variable is needed to represent two genders. Notice what happens if
you add an intercept term to the complete design matrix, dv
.
X = [ones(5,1) dv]
X = 1 0 1 1 1 0 1 1 0 1 0 1 1 1 0
rank(X)
ans = 2
dummyvar
| fitglm
| fitlm
| fitnlm
| nominal
| ordinal