Documentation Center |
On this page… |
---|
When performing regression analysis, it is common to include both continuous and categorical (quantitative and qualitative) predictor variables. When including a categorical independent variable, it is important not to input the variable as a numeric array. Numeric arrays have both order and magnitude. A categorical variable might have order (for example, an ordinal variable), but it does not have magnitude. Using a numeric array implies a known "distance" between the categories.
The appropriate way to include categorical predictors is as dummy indicator variables. An indicator variable has values 0 and 1. A categorical variable with c categories can be represented by c – 1 indicator variables.
For example, suppose you have a categorical variable with levels {Small,Medium,Large}. You can represent this variable using two dummy variables, as shown in this figure.
In this example, X_{1} is a dummy variable that has value 1 for the Medium group, and 0 otherwise. X_{2} is a dummy variable that has value 1 for the Large group, and 0 otherwise. Together, these two variables represent the three categories. Observations in the Small group have 0s for both dummy variables.
The category represented by all 0s is the reference group. When you include the dummy variables in a regression model, the coefficients of the dummy variables are interpreted with respect to the reference group.
The regression fitting functions, fitlm, fitglm, and fitnlm, recognize categorical array inputs as categorical predictors. That is, if you input your categorical predictor as a nominal or ordinal array, the fitting function automatically creates the required dummy variables. The first level returned by getlevels is the reference group. To use a different reference group, use reorderlevels to change the level order.
If there are c unique levels in the categorical array, then the fitting function estimates c – 1 regression coefficients for the categorical predictor.
Note: The fitting functions use every level of the categorical array returned by getlevels, even if there are levels with no observations. To remove levels from the categorical array, use droplevels. |
If you prefer to create your own dummy variable design matrix, use dummyvar. This function accepts a numeric or categorical column vector, and returns a matrix of indicator variables. The dummy variable design matrix has a column for every group, and a row for every observation.
For example,
gender = nominal({'Male';'Female';'Female';'Male';'Female'}); dv = dummyvar(gender)
dv = 0 1 1 0 1 0 0 1 1 0
There are five rows corresponding to the number of rows in gender, and two columns for the unique groups, Female and Male. Column order corresponds to the order of the levels in gender. For nominal arrays, the default order is ascending alphabetical.
To use these dummy variables in a regression model, you must either delete a column (to create a reference group), or fit a regression model with no intercept term. For the gender example, only one dummy variable is needed to represent two genders. Notice what happens if you add an intercept term to the complete design matrix, dv.
X = [ones(5,1) dv]
X = 1 0 1 1 1 0 1 1 0 1 0 1 1 1 0
rank(X)
ans = 2
The design matrix with an intercept term is not of full rank, and is not invertible. Because of this linear dependence, use only c – 1 indicator variables to represent a categorical variable with c categories in a regression model with an intercept term.
dummyvar | fitglm | fitlm | fitnlm | nominal | ordinal