# discretize

Group data into bins or categories

## Syntax

``Y = discretize(X,edges)``
``````[Y,E] = discretize(X,N)``````
``````[Y,E] = discretize(X,dur)``````
``[___] = discretize(___,values)``
``[___] = discretize(___,'categorical')``
``[___] = discretize(___,'categorical',displayFormat)``
``[___] = discretize(___,'categorical',categoryNames)``
``[___] = discretize(___,'IncludedEdge',side)``

## Description

````Y = discretize(X,edges)` returns the indices of the bins that contain the elements of `X`. The `j`th bin contains element `X(i)` if ```edges(j) <= X(i) < edges(j+1)``` for `1 <= j < N`, where `N` is the number of bins and ```length(edges) = N+1```. The last bin contains both edges such that ```edges(N) <= X(i) <= edges(N+1)```.```

``````[Y,E] = discretize(X,N)``` divides the range of `X` into `N` uniform bins, and also returns the bin edges `E`.```

``````[Y,E] = discretize(X,dur)```, where `X` is a datetime or duration array, divides `X` into uniform bins of `dur` length of time. `dur` can be a scalar `duration` or `calendarDuration`, or a unit of time. For example, `[Y,E] = discretize(X,'hour')` divides `X` into bins with a uniform duration of 1 hour.```

````[___] = discretize(___,values)` returns the corresponding element in `values` rather than the bin number, using any of the previous input or output argument combinations. For example, if `X(1)` is in bin 5, then `Y(1)` is `values(5)` rather than `5`. `values` must be a vector with length equal to the number of bins.```

````[___] = discretize(___,'categorical')` creates a categorical array where each bin is a category. In most cases, the default category names are of the form “`[A,B)`” (or “`[A,B]`” for the last bin), where `A` and `B` are consecutive bin edges. If you specify `dur` as a character vector, then the default category names might have special formats. See `Y` for a listing of the display formats.```

````[___] = discretize(___,'categorical',displayFormat)`, for datetime or duration array inputs, uses the specified datetime or duration display format in the category names of the output.```

````[___] = discretize(___,'categorical',categoryNames)` also names the categories in `Y` using the cell array of character vectors, `categoryNames`. The length of `categoryNames` must be equal to the number of bins.```

````[___] = discretize(___,'IncludedEdge',side)`, where `side` is `'left'` or `'right'`, specifies whether each bin includes its right or left bin edge. For example, if `side` is `'right'`, then each bin includes the right bin edge, except for the first bin which includes both edges. In this case, the `j`th bin contains an element `X(i)` if ```edges(j) < X(i) <= edges(j+1)```, where ```1 < j <= N``` and `N` is the number of bins. The first bin includes the left edge such that it contains ```edges(1) <= X(i) <= edges(2)```. The default for `side` is `'left'`.```

## Examples

Use `discretize` to group numeric values into discrete bins. `edges` defines five bin edges, so there are four bins.

`data = [1 1 2 3 6 5 8 10 4 4]`
```data = 1 1 2 3 6 5 8 10 4 4 ```
`edges = 2:2:10`
```edges = 2 4 6 8 10 ```
`Y = discretize(data,edges)`
```Y = NaN NaN 1 1 3 2 4 4 2 2 ```

`Y` indicates which bin each element of data belongs to. Since the value `1` falls outside the range of the bins, `Y` contains `NaN` values for those elements.

Group random data into three bins. Specify a second output to return the bin edges calculated by `discretize`.

```X = randn(15,1); [Y,E] = discretize(X,3)```
```Y = 2 2 1 2 2 1 1 2 3 2 ```
```E = -3 0 3 6 ```

Create a 10-by-1 datetime vector with random dates in the year 2016. Then, group the datetime values by month and return the result as a categorical array.

`X = datetime(2016,1,randi(365,10,1))`
```X = 10x1 datetime array 24-Oct-2016 26-Nov-2016 16-Feb-2016 29-Nov-2016 18-Aug-2016 05-Feb-2016 11-Apr-2016 18-Jul-2016 15-Dec-2016 18-Dec-2016 ```
`Y = discretize(X,'month','categorical')`
```Y = 10x1 categorical array Oct-2016 Nov-2016 Feb-2016 Nov-2016 Aug-2016 Feb-2016 Apr-2016 Jul-2016 Dec-2016 Dec-2016 ```

Group duration values by hour and return the result in a variety of display formats.

Group some random duration values by hour and return the results as a categorical array.

`X = hours(abs(randn(1,10)))'`
```X = 10x1 duration array 0.53767 hr 1.8339 hr 2.2588 hr 0.86217 hr 0.31877 hr 1.3077 hr 0.43359 hr 0.34262 hr 3.5784 hr 2.7694 hr ```
`Y = discretize(X,'hour','categorical')`
```Y = 10x1 categorical array [0 hr, 1 hr) [1 hr, 2 hr) [2 hr, 3 hr) [0 hr, 1 hr) [0 hr, 1 hr) [1 hr, 2 hr) [0 hr, 1 hr) [0 hr, 1 hr) [3 hr, 4 hr] [2 hr, 3 hr) ```

Change the display of the results to be a number of minutes.

`Y = discretize(X,'hour','categorical','m')`
```Y = 10x1 categorical array [0 min, 60 min) [60 min, 120 min) [120 min, 180 min) [0 min, 60 min) [0 min, 60 min) [60 min, 120 min) [0 min, 60 min) [0 min, 60 min) [180 min, 240 min] [120 min, 180 min) ```

Change the format again to display as a number of hours, minutes and seconds.

`Y = discretize(X,'hour','categorical','hh:mm:ss')`
```Y = 10x1 categorical array [00:00:00, 01:00:00) [01:00:00, 02:00:00) [02:00:00, 03:00:00) [00:00:00, 01:00:00) [00:00:00, 01:00:00) [01:00:00, 02:00:00) [00:00:00, 01:00:00) [00:00:00, 01:00:00) [03:00:00, 04:00:00] [02:00:00, 03:00:00) ```

Use the right edge of each bin as the `values` input. The values of the elements in each bin are always less than the bin value.

```X = randi(100,1,10); edges = 0:25:100; values = edges(2:end); Y = discretize(X,edges,values)```
```Y = 100 100 25 100 75 25 50 75 100 100 ```

Use the `'IncludedEdge'` input to specify that each bin includes its right bin edge. The first bin includes both edges. Compare the result to the default inclusion of left bin edges.

```X = 1:2:11; edges = [1 3 4 7 10 11]; Y = discretize(X,edges,'IncludedEdge','right')```
```Y = 1 1 3 3 4 5 ```
`Z = discretize(X,edges)`
```Z = 1 2 3 4 4 5 ```

Group numeric data into a categorical array. Use the result to confirm the amount of data that falls within 1 standard deviation of the mean value.

Group normally distributed data into bins according to the distance from the mean, measured in standard deviations.

```X = randn(1000,1); edges = std(X)*(-3:3); Y = discretize(X,edges, 'categorical', ... {'-3sigma', '-2sigma', '-sigma', 'sigma', '2sigma', '3sigma'});```

`Y` contains undefined categorical values for the elements in `X` that are farther than 3 standard deviations from the mean.

Preview the values in `Y`.

`Y(1:15)`
```ans = 15x1 categorical array sigma 2sigma -3sigma sigma sigma -2sigma -sigma sigma <undefined> 3sigma -2sigma <undefined> sigma -sigma sigma ```

Confirm that approximately 68% of the data falls within one standard deviation of the mean.

`nnz(Y=='-sigma' | Y=='sigma')/numel(Y)`
```ans = 0.6910 ```

## Input Arguments

Input array, specified as a vector, matrix, or multidimensional array. `X` contains the data that you want to distribute into bins.

Data Types: `single` | `double` | `int8` | `int16` | `int32` | `int64` | `uint8` | `uint16` | `uint32` | `uint64` | `logical` | `datetime` | `duration`

Bin edges, specified as a monotonically increasing numeric vector. Consecutive elements in `edges` form discrete bins, which `discretize` uses to partition the data in `X`. By default, each bin includes the left bin edge, except for the last bin, which includes both bin edges.

`edges` must have at least two elements, since `edges(1)` is the left edge of the first bin and `edges(end)` is the right edge of the last bin.

Example: `Y = discretize([1 3 5],[0 2 4 6])` distributes the values `1`, `3`, and `5` into three bins, which have edges `[0,2)`, `[2,4)`, and `[4,6]`.

Data Types: `single` | `double` | `int8` | `int16` | `int32` | `int64` | `uint8` | `uint16` | `uint32` | `uint64` | `logical` | `datetime` | `duration`

Number of bins, specified as a scalar integer.

Example: `[Y,E] = discretize(X,5)` divides `X` into 5 bins with a uniform width.

Uniform bin duration, specified as a scalar `duration` or `calendarDuration`, or as one of the values in the table.

If you specify `dur`, then `discretize` can use a maximum of 65,536 bins (or 216). If the specified bin duration requires more bins, then `discretize` uses a larger bin width corresponding to the maximum number of bins.

ValueWorks with...Description
`'second'`

Datetime or duration values

Each bin is 1 second.

`'minute'`

Datetime or duration values

Each bin is 1 minute.

`'hour'`

Datetime or duration values

Each bin is 1 hour.

`'day'`

Datetime or duration values

• For datetime inputs, each bin is 1 calendar day. This value accounts for Daylight Saving Time shifts.

• For duration inputs, each bin is 1 fixed-length day (24 hours).

`'week'`

Datetime values

Each bin is 1 calendar week.
`'month'`

Datetime values

Each bin is 1 calendar month.
`'quarter'`

Datetime values

Each bin is 1 calendar quarter.
`'year'`

Datetime or duration values

• For datetime inputs, each bin is 1 calendar year. This value accounts for leap days.

• For duration inputs, each bin is 1 fixed-length year (365.2425 days).

`'decade'`

Datetime values

Each bin is 1 decade (10 calendar years).
`'century'`

Datetime values

Each bin is 1 century (100 calendar years).

Example: `[Y,E] = discretize(X,'hour')` divides `X` into bins with a uniform duration of 1 hour.

Data Types: `char` | `duration` | `calendarDuration`

Bin values, specified as a vector of any data type. `values` must have the same length as the number of bins, `length(edges)-1`. The elements in `values` replace the normal bin index in the output. That is, if `X(1)` falls into bin `2`, then `discretize` returns `Y(1)` as `values(2)` rather than `2`.

If `values` is a cell array, then all the input data must belong to a bin.

Example: ```Y = discretize(randi(5,10,1),[1 1.5 3 5],diff([1 1.5 3 5]))``` returns the widths of the bins, rather than indices ranging from 1 to 3.

Datetime and duration display format, specified as a character vector. The `displayFormat` value does not change the values in `Y`, only their display. You can specify `displayFormat` using any valid display format for datetime and duration arrays. For more information about the available options, see Set Date and Time Display Format.

Example: `discretize(X,'day','categorical','h')` specifies a display format for a duration array.

Example: `discretize(X,'day','categorical','yyyy-MM-dd')` specifies a display format for a datetime array.

Data Types: `char`

Categorical array category names, specified as a cell array of character vectors. `categoryNames` must have length equal to the number of bins.

Example: ```Y = discretize(randi(5,10,1),[1 1.5 3 5],'categorical',{'A' 'B' 'C'})``` distributes the data into three categories, `A`, `B`, and `C`.

Data Types: `cell`

### Name-Value Pair Arguments

Specify optional comma-separated pairs of `Name,Value` arguments. `Name` is the argument name and `Value` is the corresponding value. `Name` must appear inside single quotes (`' '`). You can specify several name and value pair arguments in any order as `Name1,Value1,...,NameN,ValueN`.

Example: `Y = discretize(X,edges,'IncludedEdge,'right')`

Edges to include in each bin, specified as the comma-separated pair consisting of `'IncludedEdge'` and one of these values:

• `'left'` — All bins include the left bin edge, except for the last bin, which includes both edges. This is the default.

• `'right'` — All bins include the right bin edge, except for the first bin, which includes both edges.

Example: `Y = discretize(randi(11,10,1),1:2:11,'IncludedEdge','right')` includes the right bin edge in each bin.

## Output Arguments

collapse all

Bins, returned as a numeric vector, matrix, multidimensional array, or ordinal categorical array. `Y` is the same size as `X`, and each element describes the bin placement for the corresponding element in `X`. If `values` is specified, then the data type of `Y` is the same as `values`. Out-of-range elements are expressed differently depending on the data type of the output:

• For numeric outputs, `Y` contains `NaN` values for out-of-range elements in `X` (where ```X(i) < edges(1)``` or `X(i) > edges(end)`), or where `X` contains a `NaN`.

• If `Y` is a categorical array, then it contains undefined elements for out-of-range or `NaN` inputs.

• If `values` is a vector of an integer data type, then `Y` contains `0` for out-of-range or `NaN` inputs.

The default category name formats in `Y` for the syntax `discretize(X,dur,'categorical')` are:

Value of `dur`Default Category Name FormatFormat Example
`'second'`

global default format

`28-Jan-2016 10:32:06`

`'minute'`
`'hour'`
`'day'`

global default date format

`28-Jan-2016`

`'week'`

```[global_default_date_format, global_default_date_format)```

`[24-Jan-2016, 30-Jan-2016)`

`'month'`

`'MMM-uuuu'`

`Jun-2016`

`'quarter'`

`'QQQ uuuu'`

`Q4 2015`

`'year'`

`'uuuu'`

`2016`

`'decade'`

`'[uuuu, uuuu)'`

`[2010, 2020)`

`'century'`

Bin edges, returned as a vector. Specify this output to see the bin edges that `discretize` calculates in cases where you do not explicitly pass in the bin edges.

## Tips

• The behavior of `discretize` is similar to that of the `histcounts` function. Use `histcounts` to find the number of elements in each bin. On the other hand, use `discretize` to find which bin each element belongs to (without counting).