fitNaiveBayes

Train naive Bayes classifier

Syntax

  • NBModel = fitNaiveBayes(X,Y) example
  • NBModel = fitNaiveBayes(X,Y,Name,Value) example

Description

example

NBModel = fitNaiveBayes(X,Y) returns a naive Bayes classifier NBModel, trained by predictors X and class labels Y for K-level classification.

Predict labels for new data by passing the data and NBModel to predict.

example

NBModel = fitNaiveBayes(X,Y,Name,Value) returns a naive Bayes classifier with additional options specified by one or more Name,Value pair arguments.

For example, you can specify a distribution to model the data, prior probabilities for the classes, or the kernel smoothing window bandwidth.

Examples

expand all

Train a Naive Bayes Classifier

Load Fisher's iris data set.

load fisheriris
X = meas(:,3:4);
Y = species;
tabulate(Y)
       Value    Count   Percent
      setosa       50     33.33%
  versicolor       50     33.33%
   virginica       50     33.33%

The software can classify data with more than two classes using naive Bayes methods.

Train a naive Bayes classifier.

NBModel = fitNaiveBayes(X,Y)
NBModel = 

Naive Bayes classifier with 3 classes for 2 dimensions.
Feature Distribution(s):normal
Classes:setosa, versicolor, virginica


NBModel is a trained NaiveBayes classifier.

By default, the software models the predictor distribution within each class using a Gaussian distribution having some mean and standard deviation. Use dot notation to display the parameters of a particular Gaussian fit, e.g., display the fit for the first feature within setosa.

setosaIndex = strcmp(NBModel.ClassLevels,'setosa');
estimates = NBModel.Params{setosaIndex,1}
estimates =

    1.4620
    0.1737

The mean is 1.4620 and the standard deviation is 0.1737.

Plot the Gaussian contours.

figure
gscatter(X(:,1),X(:,2),Y);
xylim = cell2mat(get(gca,{'Xlim','YLim'})); % Gets current axis limits
hold on
Params = cell2mat(NBModel.Params);
Mu = Params(2*(1:3)-1,1:2); % Extracts the means
Sigma = zeros(2,2,3);
for j = 1:3
    Sigma(:,:,j) = diag(Params(2*j,:)); % Extracts the standard deviations
    ezcontour(@(x1,x2)mvnpdf([x1,x2],Mu(j,:),Sigma(:,:,j)),...
        xylim+0.5*[-1,1,-1,1]) ...
        % Draws contours for the multivariate normal distributions
end
title('Naive Bayes Classifier -- Fisher''s Iris Data')
xlabel('Petal Length (cm)')
ylabel('Petal Width (cm)')
hold off

You can change the default distribution using the name-value pair argument 'Distribution'. For example, If some predictors are count based, then you can specify that they are multinomial random variables using 'Distribution','mn' .

Specify Predictor Distributions for Naive Bayes Classifiers

Load Fisher's iris data set.

load fisheriris
X = meas;
Y = species;

Train a naive Bayes classifier using every predictor.

NBModel1 = fitNaiveBayes(X,Y);
NBModel1.ClassLevels % Display the class order
NBModel1.Params
NBModel1.Params{1,2}
ans = 

    'setosa'
    'versicolor'
    'virginica'


ans = 

    [2x1 double]    [2x1 double]    [2x1 double]    [2x1 double]
    [2x1 double]    [2x1 double]    [2x1 double]    [2x1 double]
    [2x1 double]    [2x1 double]    [2x1 double]    [2x1 double]


ans =

    3.4280
    0.3791

By default, the software models the predictor distribution within each class as a Gaussian with some mean and standard deviation. There are four predictors and three class levels. Each cell in NBModel1.Params corresponds to a numeric vector containing the mean and standard deviation of each distribution, e.g., the mean and standard deviation for setosa iris sepal widths are 3.4280 and 0.3791, respectively.

Estimate the confusion matrix for NBModel1.

predictLabels1 = predict(NBModel1,X);
[ConfusionMat1,labels] = confusionmat(Y,predictLabels1)
ConfusionMat1 =

    50     0     0
     0    47     3
     0     3    47


labels = 

    'setosa'
    'versicolor'
    'virginica'

Element (j, k) of ConfusionMat1 represents the number of observations that the software classifies as k, but the data show as being in class j.

Retrain the classifier using the Gaussian distribution for predictors 1 and 2 (the sepal lengths and widths), and the default normal kernel density for predictors 3 and 4 (the petal lengths and widths).

NBModel2 = fitNaiveBayes(X,Y,...
    'Distribution',{'normal','kernel','normal','kernel'});
NBModel2.Params{1,2}
ans = 

  KernelDistribution

    Kernel = normal
    Bandwidth = 0.179536
    Support = unbounded

The software does not train parameters to the kernel density. Rather, the software chooses an optimal width. However, you can specify a width using the 'KSWidth' name-value pair argument.

Estimate the confusion matrix for NBModel2.

predictLabels2 = predict(NBModel2,X);
ConfusionMat2 = confusionmat(Y,predictLabels2)
ConfusionMat2 =

    50     0     0
     0    47     3
     0     3    47

Based on the confusion matrices, the two classifiers perform similarly in the training sample.

Train Naive Bayes Classifiers Using Multinomial Predictors

Some spam filters classify an incoming email as spam based on how many times a word or puncutation (called tokens) occurs in an email. The predictors are the frequencies of particular words or punctuations in an email. Therefore, the predictors compose multinomial random variables.

This example illustrates classification using naive Bayes and mutlinomial predictors.

Suppose you observed 1000 emails and classified them as spam or not spam. Do this by randomly assigning -1 or 1 to y for each email.

n = 1000;                       % Sample size
rng(1);                         % For reproducibility
y = randsample([-1 1],n,true);  % Random labels

To build the predictor data, suppose that there are five tokens in the vocabulary, and 20 observed tokens per email. Generate predictor data from the five tokens by drawing multinomial deviates. The relative frequencies for tokens corresponding to spam emails should differ from emails that are not spam.

tokenProbs = [0.2 0.3 0.1 0.15 0.25;...
    0.4 0.1 0.3 0.05 0.15];             % Token relative frequencies
tokensPerEmail = 20;
X = zeros(n,5);
X(y == 1,:) = mnrnd(tokensPerEmail,tokenProbs(1,:),sum(y == 1));
X(y == -1,:) = mnrnd(tokensPerEmail,tokenProbs(2,:),sum(y == -1));

Train a naive Bayes classifier. Specify that the predictors are multinomial.

NBModel = fitNaiveBayes(X,y,'Distribution','mn');

NBModel is a trained NaiveBayes classifier.

Assess the in-sample performance of NBModel by estimating the misclassification rate.

predSpam = predict(NBModel,X);
misclass = sum(y'~=predSpam)/n
misclass =

    0.0200

The in-sample misclassification rate is 2%.

Randomly generate deviates that represent a new batch of emails.

nOut = 500;
yOut = randsample([-1 1],nOut,true);
XOut = zeros(nOut,5);
XOut(yOut == 1,:) = mnrnd(tokensPerEmail,tokenProbs(1,:),...
    sum(yOut == 1));
XOut(yOut == -1,:) = mnrnd(tokensPerEmail,tokenProbs(2,:),...
    sum(yOut == -1));

Classify the new emails using the trained naive Bayes classifier NBModel, and determine whether the algorithm generalizes.

predSpamOut = predict(NBModel,XOut);
genRate = sum(yOut'~=predSpamOut)/nOut
genRate =

    0.0260

The out-of-sample misclassification rate is 2.6% indicating that the classifier generalizes fairly well.

Input Arguments

expand all

X — Predictor datamatrix of numeric values

Predictor data to which the naive Bayes classifier is trained, specified as a matrix of numeric values.

Each row of X corresponds to one observation (also known as an instance or example), and each column corresponds to one variable (also known as a feature).

The length of Y and the number of rows of X must be equivalent.

Data Types: double

Y — Class labelscategorical array | character array | logical vector | vector of numeric values | cell array of strings

Class labels to which the naive Bayes classifier is trained, specified as a categorical or character array, logical or numeric vector, or cell array of strings. Each element of Y defines the class membership of the corresponding row of X. Y supports K class levels.

If Y is a character array, then each row must correspond to one class label.

The length of Y and the number of rows of X must be equivalent.

Data Types: cell | char | double | logical

    Note:   The software treats NaN, empty string (''), and <undefined> elements as missing values.

    • If Y contains missing values, then the software removes them and the corresponding rows of X.

    • If X contains any rows composed entirely of missing values, then the software removes those rows and the corresponding elements of Y.

    • If X contains missing values and you set 'Distribution','mn', then the software removes those rows of X and the corresponding elements of Y.

    • If a predictor is not represented in a class, that is, if all of its values are NaN within a class, then the software returns an error.

    Removing rows of X and corresponding elements of Y decreases the effective training or cross-validation sample size.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'Distribution','mn','Prior','uniform','KSWidth',0.5 specifies the following: the data distribution is multinomial, the prior probabilities for all classes are equal, and the kernel smoothing window bandwidth for all classes is 0.5 units.

'Distribution' — Data distributions'normal' (default) | 'kernel' | 'mn' | 'mvmn' | cell array of strings

Data distributions fitNaiveBayes uses to model the data, specified as the comma-separated pair consisting of 'Distribution' and a string or cell array of strings.

This table summarizes the available distributions.

ValueDescription
'kernel'Kernel smoothing density estimate.
'mn'Multinomial distribution. If you specify mn, then all features are components of a multinomial distribution. Therefore, you cannot include 'mn' as an element of a cell array of strings. For details, see Algorithms.
'mvmn'Multivariate multinomial distribution. For details, see Algorithms.
'normal'Normal (Gaussian) distribution.

If you specify a string, then the software models all the features using that distribution. If you specify a 1-by-D cell array of strings, then the software models feature j using the distribution in element j of the cell array.

Example: 'Distribution',{'kernel','normal'}

Data Types: cell | char

'KSSupport' — Kernel smoothing density support'unbounded' (default) | 'positive' | cell array | numeric row vector

Kernel smoothing density support, specified as the comma-separated pair consisting of 'KSSupport' and a numeric row vector, a string, or a cell array. The software applies the kernel smoothing density to this region.

If you do not specify 'Distribution','kernel', then the software ignores the values of 'KSSupport', 'KSType', and 'KSWidth'.

This table summarizes the available options for setting the kernel smoothing density region.

ValueDescription
1-by-2 numeric row vectorFor example, [L,U], where L and U are the finite lower and upper bounds, respectively, for the density support.
'positive'The density support is all positive real values.
'unbounded'The density support is all real values.

If you specify a 1-by-D cell array, with each cell containing any value in the table, then the software trains the classifier using the kernel support in cell j for feature j in X.

Example: 'KSSupport',{[-10,20],'unbounded'}

Data Types: cell | char | double

'KSType' — Kernel smoother type'normal' (default) | 'box' | 'epanechnikov' | 'triangle' | cell array of strings

Kernel smoother type, specified as the comma-separated pair consisting of 'KSType' and a string or cell array of strings.

If you do not specify 'Distribution','kernel', then the software ignores the values of 'KSSupport', 'KSType', and 'KSWidth'.

This table summarizes the available options for setting the kernel smoothing density region. Let I{u} denote the indictor function.

ValueKernelFormula
'box'Box (uniform)

f(x)=0.5I{|x|1}

'epanechnikov'Epanechnikov

f(x)=0.75(1x2)I{|x|1}

'normal'Gaussian

f(x)=12πexp(0.5x2)

'triangle'Triangular

f(x)=(1|x|)I{|x|1}

If you specify a 1-by-D cell array, with each cell containing any value in the table, then the software trains the classifier using the kernel smoother type in cell j for feature j in X.

Example: 'KSType',{'epanechnikov','normal'}

Data Types: cell | char

'KSWidth' — Kernel smoothing window bandwidthmatrix of numeric values (default) | numeric column vector | numeric row vector | scalar | structure array

Kernel smoothing window bandwidth, specified as the comma-separated pair consisting of 'KSWidth' and a matrix of numeric values, numeric row vector, numeric column vector, scalar, or structure array.

If you do not specify 'Distribution','kernel', then the software ignores the values of 'KSSupport', 'KSType', and 'KSWidth'.

Suppose there are K class levels and D predictors. This table summarizes the available options for setting the kernel smoothing window bandwidth.

ValueDescription
K-by-D matrix of numeric valuesElement (k,d) specifies the bandwidth for predictor d in class k.
K-by-1 numeric column vectorElement k specifies the bandwidth for all predictors in class k.
1-by-D numeric row vectorElement d specifies the bandwidth in all class levels for predictor d.
scalarSpecifies the bandwidth for all features in all classes.
structure arrayA structure array S containing class levels and their bandwidths. S must have two fields:
  • S.width: A numeric row vector of bandwidths, or a matrix of numeric values with D columns.

  • S.group: A vector of the same type as Y, containing unique class levels indicating the class for the corresponding element of S.width.

By default, the software selects a default bandwidth automatically for each combination of feature and class by using a value that is optimal for a Gaussian distribution.

Example: 'KSWidth',struct('width',[0.5,0.25],'group',{{'b';'g'}})

Data Types: double | struct

'Prior' — Class prior probabilities'empirical' (default) | 'uniform' | numeric vector | structure array

Class prior probabilities, specified as the comma-separated pair consisting of 'Prior' and a numeric vector, structure array, or string.

This table summarizes the available options for setting prior probabilities.

ValueDescription
'empirical'The software uses the class relative frequencies distribution for the prior probabilities.
numeric vector

A numeric vector of length K specifying the prior probabilities for each class. The order of the elements of Prior should correspond to the order of the class levels. For details on the order of the classes, see Algorithms.

The software normalizes prior probabilities to sum to 1.

structure arrayA structure array S containing class levels and their prior probabilities. S must have two fields:
  • S.prob: A numeric vector of prior probabilities. The software normalizes prior probabilities to sum to 1.

  • S.group: A vector of the same type as Y containing unique class levels indicating the class for the corresponding element of S.prob. S.class must contain all the K levels in Y. It can also contain classes that do not appear in Y. This can be useful if X is a subset of a larger training set. The software ignores any classes that appear in S.group but not in Y.

'uniform'The prior probabilities are equal for all classes.

Example: 'Prior',struct('prob',[1,2],'group',{{'b';'g'}})

Data Types: char | double | struct

Output Arguments

expand all

NBModel — Trained naive Bayes classifierNaiveBayes classifier

Trained naive Bayes classifier, returned as a NaiveBayes classifier.

More About

expand all

Bag-of-Tokens Model

In the bag-of-tokens model, the value of predictor j is the nonnegative number of occurrences of token j in this observation. The number of categories (bins) in this multinomial model is the number of distinct tokens, that is, the number of predictors.

Tips

  • For classifying count-based data, such as the bag-of-tokens model, use the multinomial distribution (e.g., set 'Distribution','mn').

  • This list defines the order of the classes. It is useful when you specify prior probabilities by setting 'Prior',prior, where prior is a numeric vector.

    • If Y is a categorical array, then the order of the class levels matches the output of categories(Y).

    • If Y is a numeric or logical vector, then the order of the class levels matches the output of sort(unique(Y)).

    • For cell arrays of string and character arrays, the order of the class labels is the order which each label appears in Y.

Algorithms

  • If you specify 'Distribution','mn', then the software considers each observation as multiple trials of a multinomial distribution, and considers each occurrence of a token as one trial (see Bag-of-Tokens Model).

  • If you specify 'Distribution','mvmn', then the software assumes each individual predicator follows a multinomial model within a class. The parameters for a predictor include the probabilities of all possible values that the corresponding feature can take.

Was this topic helpful?