Documentation

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English verison of the page.

Note: This page has been translated by MathWorks. Please click here
To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

autobinning

Perform automatic binning of given predictors

Syntax

sc = autobinning(sc)
sc = autobinning(sc,PredictorNames)
sc = autobinning(___,Name,Value)

Description

example

sc = autobinning(sc) performs automatic binning of all predictors.

Automatic binning finds binning maps or rules to bin numeric data and to group categories of categorical data. The binning rules are stored in the creditscorecard object. To apply the binning rules to the creditscorecard object data, or to a new dataset, use bindata.

example

sc = autobinning(sc,PredictorNames) performs automatic binning of the predictors given in PredictorNames.

Automatic binning finds binning maps or rules to bin numeric data and to group categories of categorical data. The binning rules are stored in the creditscorecard object. To apply the binning rules to the creditscorecard object data, or to a new dataset, use bindata.

example

sc = autobinning(___,Name,Value) performs automatic binning of the predictors given in PredictorNames using optional name-value pair arguments. See the name-value argument Algorithm for a description of the supported binning algorithms.

Automatic binning finds binning maps or rules to bin numeric data and to group categories of categorical data. The binning rules are stored in the creditscorecard object. To apply the binning rules to the creditscorecard object data, or to a new dataset, use bindata.

Examples

collapse all

Create a creditscorecard object using the CreditCardData.mat file to load the data (using a dataset from Refaat 2011).

load CreditCardData
sc = creditscorecard(data,'IDVar','CustID');

Perform automatic binning using the default options. By default, autobinning bins all predictors and uses the Monotone algorithm.

sc = autobinning(sc);

Use bininfo to display the binned data for the predictor CustIncome.

bi = bininfo(sc, 'CustIncome')
bi=8x6 table
          Bin          Good    Bad     Odds         WOE       InfoValue 
    _______________    ____    ___    _______    _________    __________

    '[-Inf,29000)'      53      58    0.91379     -0.79457       0.06364
    '[29000,33000)'     74      49     1.5102     -0.29217     0.0091366
    '[33000,35000)'     68      36     1.8889     -0.06843    0.00041042
    '[35000,40000)'    193      98     1.9694    -0.026696    0.00017359
    '[40000,42000)'     68      34          2    -0.011271    1.0819e-05
    '[42000,47000)'    164      66     2.4848      0.20579     0.0078175
    '[47000,Inf]'      183      56     3.2679      0.47972      0.041657
    'Totals'           803     397     2.0227          NaN       0.12285

Use plotbins to display the histogram and WOE curve for the predictor CustIncome.

plotbins(sc,'CustIncome')

Create a creditscorecard object using the CreditCardData.mat file to load the data (using a dataset from Refaat 2011).

load CreditCardData 
sc = creditscorecard(data);

Perform automatic binning for the predictor CustIncome using the default options. By default, autobinning uses the Monotone algorithm.

sc = autobinning(sc,'CustIncome');

Use bininfo to display the binned data.

bi = bininfo(sc, 'CustIncome')
bi=8x6 table null
          Bin          Good    Bad     Odds         WOE       InfoValue 
    _______________    ____    ___    _______    _________    __________

    '[-Inf,29000)'      53      58    0.91379     -0.79457       0.06364
    '[29000,33000)'     74      49     1.5102     -0.29217     0.0091366
    '[33000,35000)'     68      36     1.8889     -0.06843    0.00041042
    '[35000,40000)'    193      98     1.9694    -0.026696    0.00017359
    '[40000,42000)'     68      34          2    -0.011271    1.0819e-05
    '[42000,47000)'    164      66     2.4848      0.20579     0.0078175
    '[47000,Inf]'      183      56     3.2679      0.47972      0.041657
    'Totals'           803     397     2.0227          NaN       0.12285

Create a creditscorecard object using the CreditCardData.mat file to load the data (using a dataset from Refaat 2011).

load CreditCardData 
sc = creditscorecard(data);

Perform automatic binning for the predictor CustIncome using the Monotone algorithm with the initial number of bins set to 20. This example explicitly sets both the Algorithm and the AlgorithmOptions name-value arguments.

AlgoOptions = {'InitialNumBins',20}; 
sc = autobinning(sc,'CustIncome','Algorithm','Monotone','AlgorithmOptions',...
     AlgoOptions);

Use bininfo to display the binned data. Here, the cut points, which delimit the bins, are also displayed.

[bi,cp] = bininfo(sc,'CustIncome')
bi=11x6 table
          Bin          Good    Bad     Odds         WOE       InfoValue 
    _______________    ____    ___    _______    _________    __________

    '[-Inf,19000)'       2       3    0.66667      -1.1099     0.0056227
    '[19000,29000)'     51      55    0.92727     -0.77993      0.058516
    '[29000,31000)'     29      26     1.1154     -0.59522      0.017486
    '[31000,34000)'     80      42     1.9048    -0.060061     0.0003704
    '[34000,35000)'     33      17     1.9412    -0.041124     7.095e-05
    '[35000,40000)'    193      98     1.9694    -0.026696    0.00017359
    '[40000,42000)'     68      34          2    -0.011271    1.0819e-05
    '[42000,43000)'     39      16     2.4375      0.18655      0.001542
    '[43000,47000)'    125      50        2.5      0.21187     0.0062972
    '[47000,Inf]'      183      56     3.2679      0.47972      0.041657
    'Totals'           803     397     2.0227          NaN       0.13175

cp = 

       19000
       29000
       31000
       34000
       35000
       40000
       42000
       43000
       47000

This example shows how to use the autobinning default Monotone algorithm and the AlgorithmOptions name-value pair arguments associated with the Monotone algorithm. The AlgorithmOptions for the Monotone algorithm are three name-value pair parameters: ‘InitialNumBins', 'Trend', and 'SortCategories'. 'InitialNumBins' and 'Trend' are applicable for numeric predictors and 'Trend' and 'SortCategories' are applicable for categorical predictors.

Create a creditscorecard object using the CreditCardData.mat file to load the data (using a dataset from Refaat 2011).

load CreditCardData
sc = creditscorecard(data,'IDVar','CustID');

Perform automatic binning for the numeric predictor CustIncome using the Monotone algorithm with 20 bins. This example explicitly sets both the Algorithm argument and the AlgorithmOptions name-value arguments for 'InitialNumBins' and 'Trend'.

AlgoOptions = {'InitialNumBins',20,'Trend','Increasing'};

sc = autobinning(sc,'CustIncome','Algorithm','Monotone',...
    'AlgorithmOptions',AlgoOptions);

Use bininfo to display the binned data.

bi = bininfo(sc,'CustIncome')
bi=11x6 table
          Bin          Good    Bad     Odds         WOE       InfoValue 
    _______________    ____    ___    _______    _________    __________

    '[-Inf,19000)'       2       3    0.66667      -1.1099     0.0056227
    '[19000,29000)'     51      55    0.92727     -0.77993      0.058516
    '[29000,31000)'     29      26     1.1154     -0.59522      0.017486
    '[31000,34000)'     80      42     1.9048    -0.060061     0.0003704
    '[34000,35000)'     33      17     1.9412    -0.041124     7.095e-05
    '[35000,40000)'    193      98     1.9694    -0.026696    0.00017359
    '[40000,42000)'     68      34          2    -0.011271    1.0819e-05
    '[42000,43000)'     39      16     2.4375      0.18655      0.001542
    '[43000,47000)'    125      50        2.5      0.21187     0.0062972
    '[47000,Inf]'      183      56     3.2679      0.47972      0.041657
    'Totals'           803     397     2.0227          NaN       0.13175

Create a creditscorecard object using the CreditCardData.mat file to load the data (using a dataset from Refaat 2011).

load CreditCardData 
sc = creditscorecard(data,'IDVar','CustID');

Perform automatic binning for the predictor CustIncome and CustAge using the default Monotone algorithm with AlgorithmOptions for InitialNumBins and Trend.

AlgoOptions = {'InitialNumBins',20,'Trend','Increasing'};

sc = autobinning(sc,{'CustAge','CustIncome'},'Algorithm','Monotone',...
    'AlgorithmOptions',AlgoOptions);

Use bininfo to display the binned data.

bi1 = bininfo(sc, 'CustIncome')
bi1=11x6 table
          Bin          Good    Bad     Odds         WOE       InfoValue 
    _______________    ____    ___    _______    _________    __________

    '[-Inf,19000)'       2       3    0.66667      -1.1099     0.0056227
    '[19000,29000)'     51      55    0.92727     -0.77993      0.058516
    '[29000,31000)'     29      26     1.1154     -0.59522      0.017486
    '[31000,34000)'     80      42     1.9048    -0.060061     0.0003704
    '[34000,35000)'     33      17     1.9412    -0.041124     7.095e-05
    '[35000,40000)'    193      98     1.9694    -0.026696    0.00017359
    '[40000,42000)'     68      34          2    -0.011271    1.0819e-05
    '[42000,43000)'     39      16     2.4375      0.18655      0.001542
    '[43000,47000)'    125      50        2.5      0.21187     0.0062972
    '[47000,Inf]'      183      56     3.2679      0.47972      0.041657
    'Totals'           803     397     2.0227          NaN       0.13175

bi2 = bininfo(sc, 'CustAge')
bi2=8x6 table
        Bin        Good    Bad     Odds        WOE       InfoValue 
    ___________    ____    ___    ______    _________    __________

    '[-Inf,35)'     93      76    1.2237     -0.50255      0.038003
    '[35,40)'      114      71    1.6056      -0.2309     0.0085141
    '[40,42)'       52      30    1.7333     -0.15437     0.0016687
    '[42,44)'       58      32    1.8125     -0.10971    0.00091888
    '[44,47)'       97      51     1.902    -0.061533    0.00047174
    '[47,62)'      333     130    2.5615      0.23619      0.020605
    '[62,Inf]'      56       7         8        1.375      0.071647
    'Totals'       803     397    2.0227          NaN       0.14183

Create a creditscorecard object using the CreditCardData.mat file to load the data (using a dataset from Refaat 2011).

load CreditCardData 
sc = creditscorecard(data);

Perform automatic binning for the predictor that is a categorical predictor called ResStatus using the default options. By default, autobinning uses the Monotone algorithm.

sc = autobinning(sc,'ResStatus');

Use bininfo to display the binned data.

bi = bininfo(sc, 'ResStatus')
bi=4x6 table
        Bin         Good    Bad     Odds        WOE       InfoValue
    ____________    ____    ___    ______    _________    _________

    'Tenant'        307     167    1.8383    -0.095564    0.0036638
    'Home Owner'    365     177    2.0621     0.019329    0.0001682
    'Other'         131      53    2.4717      0.20049    0.0059418
    'Totals'        803     397    2.0227          NaN    0.0097738

This example shows how to modify the data (for this example only) to illustrate binning categorical predictors using the Monotone algorithm.

Create a creditscorecard object using the CreditCardData.mat file to load the data (using a dataset from Refaat 2011).

load CreditCardData

Add two new categories and updating the response variable.

newdata = data;
rng('default'); %for reproducibility
Predictor = 'ResStatus';
Status    = newdata.status;
NumObs    = length(newdata.(Predictor));
Ind1 = randi(NumObs,100,1);
Ind2 = randi(NumObs,100,1);
newdata.(Predictor)(Ind1) = 'Subtenant';
newdata.(Predictor)(Ind2) = 'CoOwner';
Status(Ind1) = randi(2,100,1)-1;
Status(Ind2) = randi(2,100,1)-1;

newdata.status = Status;

Update the creditscorecard object using the newdata and plot the bins for a later comparison.

scnew = creditscorecard(newdata,'IDVar','CustID');
[bi,cg] = bininfo(scnew,Predictor)
bi=6x6 table
        Bin         Good    Bad     Odds       WOE       InfoValue
    ____________    ____    ___    ______    ________    _________

    'Home Owner'    308     154         2    0.092373    0.0032392
    'Tenant'        264     136    1.9412     0.06252    0.0012907
    'Other'         109      49    2.2245     0.19875    0.0050386
    'Subtenant'      42      42         1    -0.60077     0.026813
    'CoOwner'        52      44    1.1818    -0.43372     0.015802
    'Totals'        775     425    1.8235         NaN     0.052183

cg=5x2 table
      Category      BinNumber
    ____________    _________

    'Home Owner'    1        
    'Tenant'        2        
    'Other'         3        
    'Subtenant'     4        
    'CoOwner'       5        

plotbins(scnew,Predictor)

Perform automatic binning for the categorical Predictor using the default Monotone algorithm with the AlgorithmOptions name-value pair arguments for 'SortCategories' and 'Trend'.

AlgoOptions = {'SortCategories','Goods','Trend','Increasing'};

scnew = autobinning(scnew,Predictor,'Algorithm','Monotone',...
    'AlgorithmOptions',AlgoOptions);

Use bininfo to display the bin information. The second output parameter 'cg' captures the bin membership, which is the bin number that each group belongs to.

[bi,cg] = bininfo(scnew,Predictor)
bi=4x6 table
      Bin       Good    Bad     Odds       WOE       InfoValue
    ________    ____    ___    ______    ________    _________

    'Group1'     42      42         1    -0.60077     0.026813
    'Group2'     52      44    1.1818    -0.43372     0.015802
    'Group3'    681     339    2.0088    0.096788    0.0078459
    'Totals'    775     425    1.8235         NaN      0.05046

cg=5x2 table
      Category      BinNumber
    ____________    _________

    'Subtenant'     1        
    'CoOwner'       2        
    'Other'         3        
    'Tenant'        3        
    'Home Owner'    3        

Plot bins and compare with the histogram plotted pre-binning.

plotbins(scnew,Predictor)

Input Arguments

collapse all

Credit scorecard model, specified as a creditscorecard object. Use creditscorecard to create a creditscorecard object.

Predictor or predictors names to automatically bin, specified as a character vector or a cell array of character vectors containing the name of the predictor or predictors. PredictorNames are case-sensitive and when no PredictorNames are defined, all predictors in the PredictorVars property of the creditscorecard object are binned.

Data Types: char | cell

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: sc = autobinning(sc,'Algorithm','EqualFrequency')

collapse all

Algorithm selection, specified using a character vector indicating which algorithm to use. The same algorithm is used for all predictors in PredictorNames. Possible values are:

  • 'Monotone' — (default) Monotone Adjacent Pooling Algorithm (MAPA), also known as Maximum Likelihood Monotone Coarse Classifier (MLMCC). Supervised optimal binning algorithm that aims to find bins with a monotone Weight-Of-Evidence (WOE) trend. This algorithm assumes that only neighboring attributes can be grouped. Thus, for categorical predictors, categories are sorted before applying the algorithm (see 'SortCategories' option for AlgorithmOptions). For more information, see Monotone.

  • 'EqualFrequency' — Unsupervised algorithm that divides the data into a predetermined number of bins that contain approximately the same number of observations. This algorithm is also known as “equal height” or “equal depth.” For categorical predictors, categories are sorted before applying the algorithm (see 'SortCategories' option for AlgorithmOptions). For more information, see Equal Frequency.

  • 'EqualWidth' — Unsupervised algorithm that divides the range of values in the domain of the predictor variable into a predetermined number of bins of “equal width.” For numeric data, the width is measured as the distance between bin edges. For categorical data, width is measured as the number of categories within a bin. For categorical predictors, categories are sorted before applying the algorithm (see 'SortCategories' option for AlgorithmOptions). For more information, see Equal Width.

Data Types: char

Algorithm options for the selected Algorithm, specified using a cell array. Possible values are:

  • For Monotone algorithm:

    • {'InitialNumBins',n} — Initial number (n) of bins (default is 10). 'InitialNumBins' must be an integer > 2. Used for numeric predictors only.

    • {'Trend','TrendOption'} — Determines whether the Weight-Of-Evidence (WOE) monotonic trend is expected to be increasing or decreasing. The values for 'TrendOption' are:

      • 'Auto' — (Default) Automatically determines if the WOE trend is increasing or decreasing.

      • 'Increasing' — Look for an increasing WOE trend.

      • 'Decreasing' — Look for a decreasing WOE trend.

      The value of the optional input parameter 'Trend' does not necessarily reflect that of the resulting WOE curve. The parameter 'Trend' tells the algorithm to “look for” an increasing or decreasing trend, but the outcome may not show the desired trend. For example, the algorithm cannot find a decreasing trend when the data actually has an increasing WOE trend. For more information on the 'Trend' option, see Monotone.

    • {'SortCategories','SortOption'} — Used for categorical predictors only. Used to determine how the predictor categories are sorted as a preprocessing step before applying the algorithm. The values of 'SortOption' are:

      • 'Odds' — (default) The categories are sorted by order of increasing values of odds, defined as the ratio of “Good” to “Bad” observations, for the given category.

      • 'Goods' — The categories are sorted by order of increasing values of “Good.”

      • 'Bads' — The categories are sorted by order of increasing values of “Bad.”

      • 'Totals' — The categories are sorted by order of increasing values of total number of observations (“Good” plus “Bad”).

      • 'None' — No sorting is applied. The existing order of the categories is unchanged before applying the algorithm. (The existing order of the categories can be seen in the category grouping optional output from bininfo.)

      For more information, see Sort Categories

  • For EqualFrequency algorithm:

    • {'NumBins',n} — Specifies the desired number (n) of bins. The default is {'NumBins',5} and the number of bins must be a positive number.

    • {'SortCategories','SortOption'} — Used for categorical predictors only. Used to determine how the predictor categories are sorted as a preprocessing step before applying the algorithm. The values of 'SortOption' are:

      • 'Odds' — (default) The categories are sorted by order of increasing values of odds, defined as the ratio of “Good” to “Bad” observations, for the given category.

      • 'Goods' — The categories are sorted by order of increasing values of “Good.”

      • 'Bads' — The categories are sorted by order of increasing values of “Bad.”

      • 'Totals' — The categories are sorted by order of increasing values of total number of observations (“Good” plus “Bad”).

      • 'None' — No sorting is applied. The existing order of the categories is unchanged before applying the algorithm. (The existing order of the categories can be seen in the category grouping optional output from bininfo.)

      For more information, see Sort Categories

  • For EqualWidth algorithm:

    • {'NumBins',n} — Specifies the desired number (n) of bins. The default is {'NumBins',5} and the number of bins must be a positive number.

    • {'SortCategories','SortOption'} — Used for categorical predictors only. Used to determine how the predictor categories are sorted as a preprocessing step before applying the algorithm. The values of 'SortOption' are:

      • 'Odds' — (default) The categories are sorted by order of increasing values of odds, defined as the ratio of “Good” to “Bad” observations, for the given category.

      • 'Goods' — The categories are sorted by order of increasing values of “Good.”

      • 'Bads' — The categories are sorted by order of increasing values of “Bad.”

      • 'Totals' — The categories are sorted by order of increasing values of total number of observations (“Good” plus “Bad”).

      • 'None' — No sorting is applied. The existing order of the categories is unchanged before applying the algorithm. (The existing order of the categories can be seen in the category grouping optional output from bininfo.)

      For more information, see Sort Categories

Example: sc = autobinning(sc,'CustAge','Algorithm','Monotone','AlgorithmOptions',{'Trend','Increasing'})

Data Types: cell

Indicator to display the information on status of the binning process at command line, specified using a character vector with a value of 'On' or 'Off'.

Data Types: char

Output Arguments

collapse all

Credit scorecard model, returned as an updated creditscorecard object containing the automatically determined binning maps or rules (cut points or category groupings) for one or more predictors. For more information on using the creditscorecard object, see creditscorecard.

Note

If you have previously used the modifybins function to manually modify bins, these changes are lost when running autobinning because all the data is automatically binned based on internal autobinning rules.

More About

collapse all

Monotone

The 'Monotone' algorithm is an implementation of the Monotone Adjacent Pooling Algorithm (MAPA), also known as Maximum Likelihood Monotone Coarse Classifier (MLMCC); see Anderson or Thomas in the References.

Preprocessing

During the preprocessing phase, preprocessing of numeric predictors consists in applying equal frequency binning, with the number of bins determined by the 'InitialNumBins' parameter (the default is 10 bins). The preprocessing of categorical predictors consists in sorting the categories according to the 'SortCategories' criterion (the default is to sort by odds in increasing order). Sorting is not applied to ordinal predictors. See the Sort Categories definition or the description of AlgorithmOptions option for 'SortCategories' for more information.

Main Algorithm

The following example illustrates how the 'Monotone' algorithm arrives at cut points for numeric data.

BinGoodBadIteration1Iteration2Iteration3Iteration4

'[-Inf,33000)'

1271070.543   

'[33000,38000)'

194900.6200.683  

'[38000,42000)'

135780.6240.662  

'[42000,47000)'

164660.6450.6780.713 

'[47000,Inf]'

183560.6690.7000.7400.766

Initially, the numeric data is preprocessed with an equal frequency binning. In this example, for simplicity, only the five initial bins are used. The first column indicates the equal frequency bin ranges, and the second and third columns have the “Good” and “Bad” counts per bin. (The number of observations is 1,200, so a perfect equal frequency binning would result in five bins with 240 observations each. In this case, the observations per bin do not match 240 exactly. This is a common situation when the data has repeated values.)

Monotone finds break points based on the cumulative proportion of “Good” observations. In the'Iteration1' column, the first value (0.543) is the number of “Good” observations in the first bin (127), divided by the total number of observations in the bin (127+107). The second value (0.620) is the number of “Good” observations in bins 1 and 2, divided by the total number of observations in bins 1 and 2. And so forth. The first cut point is set where the minimum of this cumulative ratio is found, which is in the first bin in this example. This is the end of iteration 1.

Starting from the second bin (the first bin after the location of the minimum value in the previous iteration), cumulative proportions of “Good” observations are computed again. The second cut point is set where the minimum of this cumulative ratio is found. In this case, it happens to be in bin number 3, therefore bins 2 and 3 are merged.

The algorithm proceeds the same way for two more iterations. In this particular example, in the end it only merges bins 2 and 3. The final binning has four bins with cut points at 33,000, 42,000, and 47,000.

For categorical data, the only difference is that the preprocessing step consists in reordering the categories. Consider the following categorical data:

BinGoodBadOdds

'Home Owner'

3651772.062

'Tenant'

3071671.838

'Other'

131532.474

The preprocessing step, by default, sorts the categories by 'Odds'. (See the Sort Categories definition or the description of AlgorithmOptions option for 'SortCategories' for more information.) Then, it applies the same steps described above, shown in the following table:

BinGoodBadOddsIteration1Iteration2Iteration3
'Tenant'3071671.8380.648  
'Home Owner'3651772.0620.6610.673 
'Other'131532.4720.6690.6830.712

In this case, the Monotone algorithm would not merge any categories. The only difference, compared with the data before the application of the algorithm, is that the categories are now sorted by 'Odds'.

In both the numeric and categorical examples above, the implicit 'Trend' choice is 'Increasing'. (See the description of AlgorithmOptions option for the 'Monotone' 'Trend' option.) If you set the trend to 'Decreasing', the algorithm looks for the maximum (instead of the minimum) cumulative ratios to determine the cut points. In that case, at iteration 1, the maximum would be in the last bin, which would imply that all bins should be merged into a single bin. Binning into a single bin is a total loss of information and has no practical use. Therefore, when the chosen trend leads to a single bin, the Monotone implementation rejects it, and the algorithm returns the bins found after the preprocessing step. This state is the initial equal frequency binning for numeric data and the sorted categories for categorical data. The implementation of the Monotone algorithm by default uses a heuristic to identify the trend ('Auto' option for 'Trend').

Equal Frequency

Unsupervised algorithm that divides the data into a predetermined number of bins that contain approximately the same number of observations.

EqualFrequency is defined as:

Let v[1], v[2],..., v[N] be the sorted list of different values or categories observed in the data. Let f[i] be the frequency of v[i]. Let F[k] = f[1]+...+f[k] be the cumulative sum of frequencies up to the kth sorted value. Then F[N] is the same as the total number of observations.

Define AvgFreq = F[N] / NumBins, which is the ideal average frequency per bin after binning. The nth cut point index is the index k such that the distance abs(F[k] - n*AvgFreq) is minimized.

This rule attempts to match the cumulative frequency up to the nth bin. If a single value contains too many observations, equal frequency bins are not possible, and the above rule yields less than NumBins total bins. In that case, the algorithm determines NumBins bins by breaking up bins, in the order in which the bins were constructed.

The preprocessing of categorical predictors consists in sorting the categories according to the 'SortCategories' criterion (the default is to sort by odds in increasing order). Sorting is not applied to ordinal predictors. See the Sort Categories definition or the description of AlgorithmOptions option for 'SortCategories' for more information.

Equal Width

Unsupervised algorithm that divides the range of values in the domain of the predictor variable into a predetermined number of bins of “equal width.” For numeric data, the width is measured as the distance between bin edges. For categorical data, width is measured as the number of categories within a bin.

The EqualWidth option is defined as:

For numeric data, if MinValue and MaxValue are the minimum and maximum data values, then

Width = (MaxValue - MinValue)/NumBins
and the CutPoints are set to MinValue + Width, MinValue + 2*Width, ... MaxValue – Width. If a MinValue or MaxValue have not been specified using the modifybins function, the EqualWidth option sets MinValue and MaxValue to the minimum and maximum values observed in the data.

For categorical data, if there are NumCats numbers of original categories, then

Width = NumCats / NumBins,
and set cut point indices to the rounded values of Width, 2*Width, ..., NumCats – Width, plus 1.

The preprocessing of categorical predictors consists in sorting the categories according to the 'SortCategories' criterion (the default is to sort by odds in increasing order). Sorting is not applied to ordinal predictors. See the Sort Categories definition or the description of AlgorithmOptions option for 'SortCategories' for more information.

Sort Categories

As a preprocessing step for categorical data, 'Monotone', 'EqualFrequency', and 'EqualWidth' support the 'SortCategories' input. This serves the purpose of reordering the categories before applying the main algorithm. The default sorting criterion is to sort by 'Odds'. For example, suppose that the data originally looks like this:

BinGoodBadOdds
'Home Owner'3651772.062
'Tenant'3071671.838
'Other'131532.472

After the preprocessing step, the rows would be sorted by 'Odds' and the table looks like this:

BinGoodBadOdds
'Tenant'3071671.838
'Home Owner'3651772.062
'Other'131532.472

The three algorithms only merge adjacent bins, so the initial order of the categories makes a difference for the final binning. The 'None' option for 'SortCategories' would leave the original table unchanged. For a description of the sorting criteria supported, see the description of the AlgorithmOptions option for 'SortCategories'.

Upon the construction of a scorecard, the initial order of the categories, before any algorithm or any binning modifications are applied, is the order shown in the first output of bininfo. If the bins have been modified (either manually with modifybins or automatically with autobinning), use the optional output (cg,'category grouping') from bininfo to get the current order of the categories.

The 'SortCategories' option has no effect on categorical predictors for which the 'Ordinal' parameter is set to true (see the 'Ordinal' input parameter in MATLAB® categorical arrays for categorical. Ordinal data has a natural order, which is honored in the preprocessing step of the algorithms by leaving the order of the categories unchanged. Only categorical predictors whose 'Ordinal' parameter is false (default option) are subject to reordering of categories according to the 'SortCategories' criterion.

Using autobinning with Weights

When observation weights are defined using the optional WeightsVar argument when creating a creditscorecard object, instead of counting the rows that are good or bad in each bin, the autobinning function accumulates the weight of the rows that are good or bad in each bin.

The “frequencies” reported are no longer the basic “count” of rows, but the “cumulative weight” of the rows that are good or bad and fall in a particular bin. Once these “weighted frequencies” are known, all other relevant statistics (Good, Bad, Odds, WOE, and InfoValue) are computed with the usual formulas. For more information, see Credit Scorecard Modeling Using Observation Weights.

References

[1] Anderson, R. The Credit Scoring Toolkit. Oxford University Press, 2007.

[2] Refaat, M. Data Preparation for Data Mining Using SAS. Morgan Kaufmann, 2006.

[3] Refaat, M. Credit Risk Scorecards: Development and Implementation Using SAS. lulu.com, 2011.

[4] Thomas, L., et al. Credit Scoring and Its Applications. Society for Industrial and Applied Mathematics, 2002.

Introduced in R2014b

Was this topic helpful?