discretize

Group data into bins or categories

Syntax

Y = discretize(X,edges)

[Y,E] =
discretize(X,N)

[Y,E] =
discretize(X,dur)

[___] = discretize(___,values)

[___] = discretize(___,'categorical')

[___] = discretize(___,'categorical',displayFormat)

[___] = discretize(___,'categorical',categoryNames)

[___] = discretize(___,'IncludedEdge',side)

Description

example

Y = discretize(X,edges) returns the indices of the bins that contain the elements of X. The jth bin contains element X(i) if edges(j) <= X(i) < edges(j+1) for 1 <= j < N, where N is the number of bins and length(edges) = N+1. The last bin contains both edges such that edges(N) <= X(i) <= edges(N+1).

example

[Y,E] = discretize(X,N) divides the data in X into N bins of uniform width, and also returns the bin edges E.

example

[Y,E] = discretize(X,dur), where X is a datetime or duration array, divides X into uniform bins of dur length of time. dur can be a scalar duration or calendarDuration, or a unit of time. For example, [Y,E] = discretize(X,'hour') divides X into bins with a uniform duration of 1 hour.

example

[___] = discretize(___,values) returns the corresponding element in values rather than the bin number, using any of the previous input or output argument combinations. For example, if X(1) is in bin 5, then Y(1) is values(5) rather than 5. values must be a vector with length equal to the number of bins.

example

[___] = discretize(___,'categorical') creates a categorical array where each bin is a category. In most cases, the default category names are of the form “[A,B)” (or “[A,B]” for the last bin), where A and B are consecutive bin edges. If you specify dur as a character vector, then the default category names might have special formats. See Y for a listing of the display formats.

example

[___] = discretize(___,'categorical',displayFormat), for datetime or duration array inputs, uses the specified datetime or duration display format in the category names of the output.

example

[___] = discretize(___,'categorical',categoryNames) also names the categories in Y using the cell array of character vectors, categoryNames. The length of categoryNames must be equal to the number of bins.

example

[___] = discretize(___,'IncludedEdge',side), where side is 'left' or 'right', specifies whether each bin includes its right or left bin edge. For example, if side is 'right', then each bin includes the right bin edge, except for the first bin which includes both edges. In this case, the jth bin contains an element X(i) if edges(j) < X(i) <= edges(j+1), where 1 < j <= N and N is the number of bins. The first bin includes the left edge such that it contains edges(1) <= X(i) <= edges(2). The default for side is 'left'.

Examples

collapse all

Group Data into Bins

Open Live Script

Use discretize to group numeric values into discrete bins. edges defines five bin edges, so there are four bins.

data = [1 1 2 3 6 5 8 10 4 4]

data = 1×10

     1     1     2     3     6     5     8    10     4     4

edges = 2:2:10

edges = 1×5

     2     4     6     8    10

Y = discretize(data,edges)

Y = 1×10

   NaN   NaN     1     1     3     2     4     4     2     2

Y indicates which bin each element of data belongs to. Since the value 1 falls outside the range of the bins, Y contains NaN values for those elements.

Group Data into Specified Number of Bins

Open Live Script

Group random data into three bins. Specify a second output to return the bin edges calculated by discretize.

X = randn(10,1);
[Y,E] = discretize(X,3)

E = 1×4

    -3     0     3     6

Group Datetime Data by Month

Open Live Script

Create a 10-by-1 datetime vector with random dates in the year 2016. Then, group the datetime values by month and return the result as a categorical array.

X = datetime(2016,1,randi(365,10,1))

X = 10x1 datetime
   24-Oct-2016
   26-Nov-2016
   16-Feb-2016
   29-Nov-2016
   18-Aug-2016
   05-Feb-2016
   11-Apr-2016
   18-Jul-2016
   15-Dec-2016
   18-Dec-2016

Y = discretize(X,'month','categorical')

Y = 10x1 categorical
     Oct-2016 
     Nov-2016 
     Feb-2016 
     Nov-2016 
     Aug-2016 
     Feb-2016 
     Apr-2016 
     Jul-2016 
     Dec-2016 
     Dec-2016

Change Display Format of Duration Values

Open Live Script

Group duration values by hour and return the result in a variety of display formats.

Group some random duration values by hour and return the results as a categorical array.

X = hours(abs(randn(1,10)))'

X = 10x1 duration
   0.53767 hr
    1.8339 hr
    2.2588 hr
   0.86217 hr
   0.31877 hr
    1.3077 hr
   0.43359 hr
   0.34262 hr
    3.5784 hr
    2.7694 hr

Y = discretize(X,'hour','categorical')

Y = 10x1 categorical
     [0 hr, 1 hr) 
     [1 hr, 2 hr) 
     [2 hr, 3 hr) 
     [0 hr, 1 hr) 
     [0 hr, 1 hr) 
     [1 hr, 2 hr) 
     [0 hr, 1 hr) 
     [0 hr, 1 hr) 
     [3 hr, 4 hr] 
     [2 hr, 3 hr)

Change the display of the results to be a number of minutes.

Y = discretize(X,'hour','categorical','m')

Y = 10x1 categorical
     [0 min, 60 min) 
     [60 min, 120 min) 
     [120 min, 180 min) 
     [0 min, 60 min) 
     [0 min, 60 min) 
     [60 min, 120 min) 
     [0 min, 60 min) 
     [0 min, 60 min) 
     [180 min, 240 min] 
     [120 min, 180 min)

Change the format again to display as a number of hours, minutes and seconds.

Y = discretize(X,'hour','categorical','hh:mm:ss')

Y = 10x1 categorical
     [00:00:00, 01:00:00) 
     [01:00:00, 02:00:00) 
     [02:00:00, 03:00:00) 
     [00:00:00, 01:00:00) 
     [00:00:00, 01:00:00) 
     [01:00:00, 02:00:00) 
     [00:00:00, 01:00:00) 
     [00:00:00, 01:00:00) 
     [03:00:00, 04:00:00] 
     [02:00:00, 03:00:00)

Assign Bin Values

Open Live Script

Use the right edge of each bin as the values input. The values of the elements in each bin are always less than the bin value.

X = randi(100,1,10);
edges = 0:25:100;
values = edges(2:end);
Y = discretize(X,edges,values)

Y = 1×10

   100   100    25   100    75    25    50    75   100   100

Include Right Edge of Each Bin

Open Live Script

Use the 'IncludedEdge' input to specify that each bin includes its right bin edge. The first bin includes both edges. Compare the result to the default inclusion of left bin edges.

X = 1:2:11;
edges = [1 3 4 7 10 11];
Y = discretize(X,edges,'IncludedEdge','right')

Y = 1×6

     1     1     3     3     4     5

Z = discretize(X,edges)

Z = 1×6

     1     2     3     4     4     5

Group Data into Categorical Array

Open Live Script

Group numeric data into a categorical array. Use the result to confirm the amount of data that falls within 1 standard deviation of the mean value.

Group normally distributed data into bins according to the distance from the mean, measured in standard deviations.

X = randn(1000,1);
edges = std(X)*(-3:3);
Y = discretize(X,edges, 'categorical', ...
    {'-3sigma', '-2sigma', '-sigma', 'sigma', '2sigma', '3sigma'});

Y contains undefined categorical values for the elements in X that are farther than 3 standard deviations from the mean.

Preview the values in Y.

Y(1:15)

ans = 15x1 categorical
     sigma 
     2sigma 
     -3sigma 
     sigma 
     sigma 
     -2sigma 
     -sigma 
     sigma 
     <undefined> 
     3sigma 
     -2sigma 
     <undefined> 
     sigma 
     -sigma 
     sigma

Confirm that approximately 68% of the data falls within one standard deviation of the mean.

nnz(Y=='-sigma' | Y=='sigma')/numel(Y)

ans = 0.6910

Input Arguments

collapse all

`X` — Input array
vector | matrix | multidimensional array

Input array, specified as a vector, matrix, or multidimensional array. X contains the data that you want to distribute into bins.

`edges` — Bin edges
numeric vector

Bin edges, specified as a numeric vector with increasing values. The bin edges can contain consecutive repeated elements. Consecutive elements in edges form discrete bins, which discretize uses to partition the data in X. By default, each bin includes the left bin edge, except for the last bin, which includes both bin edges.

edges must have at least two elements, since edges(1) is the left edge of the first bin and edges(end) is the right edge of the last bin.

Example: Y = discretize([1 3 5],[0 2 4 6]) distributes the values 1, 3, and 5 into three bins, which have edges [0,2), [2,4), and [4,6].

`N` — Number of bins
scalar integer

Number of bins, specified as a scalar integer.

discretize divides the data into N bins of uniform width, choosing the bin edges to be "nice" numbers that overlap the range of the data. The largest and smallest elements in X do not typically fall right on the bin edges. If the data is unevenly distributed, then some of the intermediate bins can be empty. However, the first and last bin always include at least one piece of data.

Example: [Y,E] = discretize(X,5) distributes the data in X into 5 bins with a uniform width.

`dur` — Uniform bin duration
scalar `duration` | scalar `calendarDuration` | `'second'` | `'minute'` | `'hour'` | `'day'` | `'week'` | `'month'` | `'quarter'` | `'year'` | `'decade'` | `'century'`

Uniform bin duration, specified as a scalar duration or calendarDuration, or as one of the values in the table.

If you specify dur, then discretize can use a maximum of 65,536 bins (or 2¹⁶). If the specified bin duration requires more bins, then discretize uses a larger bin width corresponding to the maximum number of bins.

Value	Works with...	Description
`'second'`	Datetime or duration values	Each bin is 1 second.
`'minute'`	Datetime or duration values	Each bin is 1 minute.
`'hour'`	Datetime or duration values	Each bin is 1 hour.
`'day'`	Datetime or duration values	For datetime inputs, each bin is 1 calendar day. This value accounts for Daylight Saving Time shifts. For duration inputs, each bin is 1 fixed-length day (24 hours).
`'week'`	Datetime values	Each bin is 1 calendar week.
`'month'`	Datetime values	Each bin is 1 calendar month.
`'quarter'`	Datetime values	Each bin is 1 calendar quarter.
`'year'`	Datetime or duration values	For datetime inputs, each bin is 1 calendar year. This value accounts for leap days. For duration inputs, each bin is 1 fixed-length year (365.2425 days).
`'decade'`	Datetime values	Each bin is 1 decade (10 calendar years).
`'century'`	Datetime values	Each bin is 1 century (100 calendar years).

Example: [Y,E] = discretize(X,'hour') divides X into bins with a uniform duration of 1 hour.

Data Types: char | duration | calendarDuration

`values` — Bin values
vector

Bin values, specified as a vector of any data type. values must have the same length as the number of bins, length(edges)-1. The elements in values replace the normal bin index in the output. That is, if X(1) falls into bin 2, then discretize returns Y(1) as values(2) rather than 2.

If values is a cell array, then all the input data must belong to a bin.

Example: Y = discretize(randi(5,10,1),[1 1.5 3 5],diff([1 1.5 3 5])) returns the widths of the bins, rather than indices ranging from 1 to 3.

`displayFormat` — Datetime and duration display format
character vector

Datetime and duration display format, specified as a character vector. The displayFormat value does not change the values in Y, only their display. You can specify displayFormat using any valid display format for datetime and duration arrays. For more information about the available options, see Set Date and Time Display Format.

Example: discretize(X,'day','categorical','h') specifies a display format for a duration array.

Example: discretize(X,'day','categorical','yyyy-MM-dd') specifies a display format for a datetime array.

Data Types: char

`categoryNames` — Categorical array category names
cell array of character vectors

Categorical array category names, specified as a cell array of character vectors. categoryNames must have length equal to the number of bins.

Example: Y = discretize(randi(5,10,1),[1 1.5 3 5],'categorical',{'A' 'B' 'C'}) distributes the data into three categories, A, B, and C.

Data Types: cell

`side` — Edge to include in each bin
`'left'` (default) | `'right'`

Edge to include in each bin, specified as one of these values:

'left' — All bins include the left bin edge, except for the last bin, which includes both edges. This is the default.
'right' — All bins include the right bin edge, except for the first bin, which includes both edges.

Example: Y = discretize(randi(11,10,1),1:2:11,'IncludedEdge','right') includes the right bin edge in each bin.

Output Arguments

collapse all

`Y` — Bins
vector | matrix | multidimensional array | ordinal categorical array

Bins, returned as a numeric vector, matrix, multidimensional array, or ordinal categorical array. Y is the same size as X, and each element describes the bin placement for the corresponding element in X. If values is specified, then the data type of Y is the same as values. Out-of-range elements are expressed differently depending on the data type of the output:

For numeric outputs, Y contains NaN values for out-of-range elements in X (where X(i) < edges(1) or X(i) > edges(end)), or where X contains a NaN.
If Y is a categorical array, then it contains undefined elements for out-of-range or NaN inputs.
If values is a vector of an integer data type, then Y contains 0 for out-of-range or NaN inputs.

The default category name formats in Y for the syntax discretize(X,dur,'categorical') are:

Value of `dur`	Default Category Name Format	Format Example
`'second'`	global default format	`28-Jan-2016 10:32:06`
`'minute'`
`'hour'`
`'day'`	global default date format	`28-Jan-2016`
`'week'`	`[global_default_date_format, global_default_date_format)`	`[24-Jan-2016, 30-Jan-2016)`
`'month'`	`'MMM-uuuu'`	`Jun-2016`
`'quarter'`	`'QQQ uuuu'`	`Q4 2015`
`'year'`	`'uuuu'`	`2016`
`'decade'`	`'[uuuu, uuuu)'`	`[2010, 2020)`
`'century'`	`'[uuuu, uuuu)'`	`[2010, 2020)`

`E` — Bin edges
vector

Bin edges, returned as a vector. Specify this output to see the bin edges that discretize calculates in cases where you do not explicitly pass in the bin edges.

E is returned as a row vector whenever discretize calculates the bin edges. If you pass in bin edges, then E retains the orientation of the edges input.

Tips

The behavior of discretize is similar to that of the histcounts function. Use histcounts to find the number of elements in each bin. On the other hand, use discretize to find which bin each element belongs to (without counting).

Extended Capabilities

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

This function fully supports tall arrays. For more information, see Tall Arrays.

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

Thread-Based Environment
Run code in the background using MATLAB® `backgroundPool` or accelerate code with Parallel Computing Toolbox™ `ThreadPool`.

This function fully supports thread-based environments. For more information, see Run MATLAB Functions in Thread-Based Environment.

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Usage notes and limitations:

The categorical option is not supported.

For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

Version History

Introduced in R2015a

discretize

Syntax

Description

Examples

Group Data into Bins

Group Data into Specified Number of Bins

Group Datetime Data by Month

Change Display Format of Duration Values

Assign Bin Values

Include Right Edge of Each Bin

Group Data into Categorical Array

Input Arguments

`X` — Input array
vector | matrix | multidimensional array

`edges` — Bin edges
numeric vector

`N` — Number of bins
scalar integer

`dur` — Uniform bin duration
scalar `duration` | scalar `calendarDuration` | `'second'` | `'minute'` | `'hour'` | `'day'` | `'week'` | `'month'` | `'quarter'` | `'year'` | `'decade'` | `'century'`

`values` — Bin values
vector

`displayFormat` — Datetime and duration display format
character vector

`categoryNames` — Categorical array category names
cell array of character vectors

`side` — Edge to include in each bin
`'left'` (default) | `'right'`

Output Arguments

`Y` — Bins
vector | matrix | multidimensional array | ordinal categorical array

`E` — Bin edges
vector

Tips

Extended Capabilities

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

Thread-Based Environment
Run code in the background using MATLAB® `backgroundPool` or accelerate code with Parallel Computing Toolbox™ `ThreadPool`.

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

See Also

Topics

discretize

Syntax

Description

Examples

Group Data into Bins

Group Data into Specified Number of Bins

Group Datetime Data by Month

Change Display Format of Duration Values

Assign Bin Values

Include Right Edge of Each Bin

Group Data into Categorical Array

Input Arguments

X — Input array vector | matrix | multidimensional array

edges — Bin edges numeric vector

N — Number of bins scalar integer

dur — Uniform bin duration scalar duration | scalar calendarDuration | 'second' | 'minute' | 'hour' | 'day' | 'week' | 'month' | 'quarter' | 'year' | 'decade' | 'century'

values — Bin values vector

displayFormat — Datetime and duration display format character vector

categoryNames — Categorical array category names cell array of character vectors

side — Edge to include in each bin 'left' (default) | 'right'

Output Arguments

Y — Bins vector | matrix | multidimensional array | ordinal categorical array

E — Bin edges vector

Tips

Extended Capabilities

Tall Arrays Calculate with arrays that have more rows than fit in memory.

C/C++ Code Generation Generate C and C++ code using MATLAB® Coder™.

Thread-Based Environment Run code in the background using MATLAB® backgroundPool or accelerate code with Parallel Computing Toolbox™ ThreadPool.

GPU Arrays Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

See Also

Topics

`X` — Input array
vector | matrix | multidimensional array

`edges` — Bin edges
numeric vector

`N` — Number of bins
scalar integer

`dur` — Uniform bin duration
scalar `duration` | scalar `calendarDuration` | `'second'` | `'minute'` | `'hour'` | `'day'` | `'week'` | `'month'` | `'quarter'` | `'year'` | `'decade'` | `'century'`

`values` — Bin values
vector

`displayFormat` — Datetime and duration display format
character vector

`categoryNames` — Categorical array category names
cell array of character vectors

`side` — Edge to include in each bin
`'left'` (default) | `'right'`

`Y` — Bins
vector | matrix | multidimensional array | ordinal categorical array

`E` — Bin edges
vector

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

Thread-Based Environment
Run code in the background using MATLAB® `backgroundPool` or accelerate code with Parallel Computing Toolbox™ `ThreadPool`.

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.