MATLAB Examples

Categorize Numeric Data

This example shows how to categorize numeric data into a categorical ordinal array using ordinal. This is useful for discretizing continuous data.


Load sample data.

The dataset array, hospital, contains variables measured on a sample of patients. Compute the minimum, median, and maximum of the variable Age.

load hospital
ans =

    25    39    50

The patient ages range from 25 to 50.

Convert a numeric array to an ordinal array.

Group patients into the age categories Under 30, 30-39, Over 40.

hospital.AgeCat = ordinal(hospital.Age,{'Under 30','30-39','Over 40'},...
ans = 

  1x3 ordinal array

     Under 30      30-39      Over 40 

The last input argument to ordinal has the endpoints for the categories. The first category begins at age 25, the second at age 30, and so on. The last category contains ages 40 and above, so begins at 40 and ends at 50 (the maximum age in the data set). To specify three categories, you must specify four endpoints (the last endpoint is the upper bound of the last category).

Explore categories.

Display the age and age category for the second patient.

ans = 

    Age    AgeCategory
    43     Over 40    

When you discretize a numeric array into categories, the categorical array loses all information about the actual numeric values. In this example, AgeCat is not numeric, and you cannot recover the raw data values from it.

Categorize a numeric array into quartiles.

The variable Weight has weight measurements for the sample patients. Categorize the patient weights into four categories, by quartile.

p = 0:.25:1;
breaks = quantile(hospital.Weight,p);
hospital.WeightQ = ordinal(hospital.Weight,{'Q1','Q2','Q3','Q4'},...
ans = 

  1x4 ordinal array

     Q1      Q2      Q3      Q4 

Explore categories.

Display the weight and weight quartile for the second patient.

ans = 

    Weight    WeightQuartile
    163       Q3            

Summary statistics grouped by category levels.

Compute the mean systolic and diastolic blood pressure for each age and weight category.

ans = 

                   AgeCat      WeightQ    GroupCount    mean_BloodPressure
    Under 30_Q1    Under 30    Q1          6            123.17      79.667
    Under 30_Q2    Under 30    Q2          3            120.33      79.667
    Under 30_Q3    Under 30    Q3          2             127.5        86.5
    Under 30_Q4    Under 30    Q4          4               122          78
    30-39_Q1       30-39       Q1         12            121.75       81.75
    30-39_Q2       30-39       Q2          9            119.56      82.556
    30-39_Q3       30-39       Q3          9               121      83.222
    30-39_Q4       30-39       Q4         11            125.55      87.273
    Over 40_Q1     Over 40     Q1          7            122.14      84.714
    Over 40_Q2     Over 40     Q2         13            123.38      79.385
    Over 40_Q3     Over 40     Q3         14            123.07      84.643
    Over 40_Q4     Over 40     Q4         10             124.6        85.1

The variable BloodPressure is a matrix with two columns. The first column is systolic blood pressure, and the second column is diastolic blood pressure. The group in the sample with the highest mean diastolic blood pressure, 87.273, is aged 30–39 and in the highest weight quartile, 30-39_Q4.