MATLAB Examples

Speaker Identification Using Pitch and MFCC

This example demonstrates a machine learning approach to identify people based on features extracted from recorded speech. The features used to train the classifier are: pitch of the voiced segments of the speech, and the Mel Frequency Cepstrum Coefficients (MFCC). This is a closed-set speaker identification - the audio of the speaker under test is compared against all the available speaker models (a finite set) and the closest match is returned.

This example requires Statistics and Machine Learning Toolbox.

Contents

Introduction

The approach used in this example for speaker identification is shown in the diagram.

Pitch and Mel Frequency Cepstrum Coefficients (MFCC) are extracted from speech signals recorded for 10 speakers. These features are used to train a K-Nearest Neighbor (KNN) classifier. Then, new speech signals that need to be classified go through the same feature extraction. The trained KNN classifier predicts which one of the ten speakers is the closest match.

Features Used for Classification

This section discusses pitch and Mel Frequency Cepstrum Coefficients (MFCCs), the two features that are used to classify speakers.

Pitch

Speech can be broadly categorized as voiced and unvoiced. In the case of voiced speech, air from the lungs is modulated by vocal cords and results in a quasi-periodic excitation. The resulting sound is dominated by a relatively low-frequency oscillation, referred to as pitch. In the case of unvoiced speech, air from the lungs passes through a constriction in the vocal tract and becomes a turbulent, noise-like excitation. In the source-filter model of speech, the excitation is referred to as the source, and the vocal tract is referred to as the filter. Characterizing the source is an important part of characterizing the speech system.

As an example of voiced and unvoiced speech, consider a time-domain representation of the word "two" (/T UW/). The consonant /T/ (unvoiced speech) looks like noise, while the vowel /UW/ (voiced speech) is characterized by a strong fundamental frequency.

[audioIn, fs] = audioread('Counting-16-44p1-mono-15secs.wav');
twoStart = 110e3;
twoStop = 135e3;
audioIn = audioIn(twoStart:twoStop);
timeVector = linspace((twoStart/fs),(twoStop/fs),numel(audioIn));
figure;
plot(timeVector,audioIn);
axis([(twoStart/fs) (twoStop/fs) -1 1]);
ylabel('Amplitude');
xlabel('Time (s)');
title('Utterance - Two')
sound(audioIn,fs);

The simplest method to distinguish between voiced and unvoiced speech is to analyze the zero crossing rate. A large number of zero crossings implies that there is no dominate low-frequency oscillation.

Once you isolate a region of voiced speech, you can characterize it by estimating the pitch. This example uses pitch to estimate the pitch. It uses the default normalized autocorrelation approach to calculating pitch.

Apply pitch detection to the word "two" to see how pitch changes over time. This is known as the pitch contour, and is characteristic to a speaker.

pD = audiopluginexample.SpeechPitchDetector;
[~,pitch] = process(pD,audioIn);

figure;
subplot(2,1,1);
plot(timeVector,audioIn);
axis([(110e3/fs) (135e3/fs) -1 1])
ylabel('Amplitude')
xlabel('Time (s)')
title('Utterance - Two')

subplot(2,1,2)
plot(timeVector,pitch,'*')
axis([(110e3/fs) (135e3/fs) 80 140])
ylabel('Pitch (Hz)')
xlabel('Time (s)');
title('Pitch Contour');

Mel-Frequency Cepstrum Coefficients (MFCC)

Mel-Frequency Cepstrum Coefficients (MFCC) are popular features extracted from speech signals for use in recognition tasks. In the source-filter model of speech, MFCCs are understood to represent the filter (vocal tract). The frequency response of the vocal tract is relatively smooth, whereas the source of voiced speech can be modeled as an impulse train. The result is that the vocal tract can be estimated by the spectral envelope of a speech segment.

The motivating idea of MFCC is to compress information about the vocal tract (smoothed spectrum) into a small number of coefficients based on an understanding of the cochlea.

Although there is no hard standard for calculating MFCC, the basic steps are outlined by the diagram.

The mel filterbank uses 10 linearly spaced triangular filters and then logarithmically spaces them thereafter. The individual bands are weighted for even energy. Below is a visualization of a typical mel filterbank.

This example uses mfcc to calculate the MFCCs for every file.

A speech signal is dynamic in nature and changes over time. It is assumed that speech signals are stationary on short time scales and their processing is done in windows of 20-40 ms. This example uses a 30 ms window with a 75% overlap.

Data Set

This example uses the Census Database (also known as AN4 Database) from the CMU Robust Speech Recognition Group [1]. The data set contains recordings of male and female subjects speaking words and numbers. The helper function in this section downloads it for you and converts the raw files to flac. The speech files are partitioned into subdirectories based on the labels corresponding to the speakers. If you are unable to download it, you can load a table of features from HelperAN4TrainingFeatures.mat and proceed directly to the Training a Classifier section. The features have been extracted from the same data set.

Download and extract the speech files for 10 speakers (5 female and 5 male) into a temporary directory using the HelperAN4Download function.

dataDir = HelperAN4Download; % Path to data directory

Create an audioexample.Datastore object to easily manage this database for training. The datastore allows you to collect necessary files of a file format and read them.

ads = audioexample.Datastore(dataDir, 'IncludeSubfolders', true,...
    'FileExtensions', '.flac', 'ReadMethod','File',...
    'LabelSource','foldernames')
ads = 
  Datastore with properties:

              Files: {
                     ' ...\jibrahim\AppData\Local\Temp\an4\wav\flacData\fejs\an36-fejs-b.flac';
                     ' ...\jibrahim\AppData\Local\Temp\an4\wav\flacData\fejs\an37-fejs-b.flac';
                     ' ...\jibrahim\AppData\Local\Temp\an4\wav\flacData\fejs\an38-fejs-b.flac'
                      ... and 122 more
                     }
             Labels: [fejs; fejs; fejs ... and 122 more categorical]
        ReadMethod: 'File'
    OutputDataType: 'double'

The splitEachLabel method of audioexample.Datastore splits the datastore into two or more datastores. The resulting datastores have the specified proportion of the audio files from each label. In this example, the datastore is split into two parts. 80% of the data for each label is used for training, and the remaining 20% is used for testing. The countEachLabel method of audioexample.Datastore is used to count the number of audio files per label. In this example, the label identifies the speaker.

[trainDatastore, testDatastore]  = splitEachLabel(ads,0.80);

Display the datastore and the number of speakers, in the train datastore.

trainDatastore
trainDatastoreCount = countEachLabel(trainDatastore)
trainDatastore = 
  Datastore with properties:

              Files: {
                     ' ...\jibrahim\AppData\Local\Temp\an4\wav\flacData\fejs\an36-fejs-b.flac';
                     ' ...\jibrahim\AppData\Local\Temp\an4\wav\flacData\fejs\an37-fejs-b.flac';
                     ' ...\jibrahim\AppData\Local\Temp\an4\wav\flacData\fejs\an38-fejs-b.flac'
                      ... and 94 more
                     }
             Labels: [fejs; fejs; fejs ... and 94 more categorical]
        ReadMethod: 'File'
    OutputDataType: 'double'
trainDatastoreCount =
  10×2 table
    Label    Count
    _____    _____
    fejs      10  
    fmjd      10  
    fsrb      10  
    ftmj      10  
    fwxs      10  
    mcen      10  
    mrcb      10  
    msjm      10  
    msjr      10  
    msmn       7  

Display the datastore and the number of speakers, in the test datastore.

testDatastore
testDatastoreCount = countEachLabel(testDatastore)
testDatastore = 
  Datastore with properties:

              Files: {
                     ' ...\jibrahim\AppData\Local\Temp\an4\wav\flacData\fejs\cen6-fejs-b.flac';
                     ' ...\jibrahim\AppData\Local\Temp\an4\wav\flacData\fejs\cen7-fejs-b.flac';
                     ' ...\jibrahim\AppData\Local\Temp\an4\wav\flacData\fejs\cen8-fejs-b.flac'
                      ... and 25 more
                     }
             Labels: [fejs; fejs; fejs ... and 25 more categorical]
        ReadMethod: 'File'
    OutputDataType: 'double'
testDatastoreCount =
  10×2 table
    Label    Count
    _____    _____
    fejs       3  
    fmjd       3  
    fsrb       3  
    ftmj       3  
    fwxs       2  
    mcen       3  
    mrcb       3  
    msjm       3  
    msjr       3  
    msmn       2  

To preview the content of your datastore, read a sample file and play it using your default audio device.

[sampleTrain, info] = read(trainDatastore);
sound(sampleTrain,info.SampleRate)

Reading from the train datastore pushes the read pointer so that you can iterate through the database. Reset the train datastore to return the read pointer to the start for the following feature extraction.

reset(trainDatastore);

Feature Extraction

Pitch and MFCC features are extracted from each frame using HelperComputePitchAndMFCC HelperComputePitchAndMFCC which performs the following actions on the data read from each audio file:

  1. Collect the samples into frames of 30 ms with an overlap of 75%.
  2. For each frame, use audiopluginexample.SpeechPitchDetector.isVoicedSpeech to decide whether the samples correspond to a voiced speech segment.
  3. Compute the pitch and 13 MFCCs (with the first MFCC coefficient replaced by log-energy of the audio signal) for the entire file.
  4. Keep the pitch and MFCC information pertaining to the voiced frames only.
  5. Get the directory name for the file. This corresponds to the name of the speaker and will be used as a label for training the classifier.

HelperComputePitchAndMFCC returns a table containing the filename, pitch, MFCCs, and label (speaker name) as columns for each 30 ms frame.

lenDataTrain = length(trainDatastore.Files);
features = cell(lenDataTrain,1);
for i = 1:lenDataTrain
    [dataTrain, infoTrain] = read(trainDatastore);
    features{i} = HelperComputePitchAndMFCC(dataTrain,infoTrain);
end
features = vertcat(features{:});
features = rmmissing(features);
head(features)   % Display the first few rows
ans =
  8×16 table
         Filename         Pitch      MFCC1     MFCC2      MFCC3      MFCC4      MFCC5       MFCC6        MFCC7       MFCC8       MFCC9      MFCC10       MFCC11      MFCC12      MFCC13      Label 
    __________________    ______    _______    ______    _______    _______    _______    _________    _________    ________    _______    _________    ________    ________    _________    ______
    'an36-fejs-b.flac'    237.44    -4.4218    3.3816    0.73331    0.98626    0.47093      0.13808    -0.083348    0.069072     0.2345       0.3403    -0.14417    -0.15685     0.022186    'fejs'
    'an36-fejs-b.flac'    242.42    -4.3104    4.7899    0.80432     0.7148    0.46027     0.032963     -0.28647     0.38366     0.1449    0.0093271     -0.2559    -0.17832     -0.11693    'fejs'
    'an36-fejs-b.flac'    231.88    -3.6432    5.0192    0.74801    0.58299    0.50475    -0.014551     -0.32653     0.39201    0.20982     -0.20282    -0.25637    -0.20576     -0.27675    'fejs'
    'an36-fejs-b.flac'    230.89    -3.0934     5.132    0.46794    0.57104    0.64546    -0.085145     -0.22453     0.55408    0.14131     -0.17966    -0.17135    -0.22111     -0.22027    'fejs'
    'an36-fejs-b.flac'    112.49    -2.9718    5.3249    0.48934    0.66976    0.56446     -0.14691     -0.26824      0.4536    0.31515     -0.21356    -0.34067    -0.21872     -0.14108    'fejs'
    'an36-fejs-b.flac'    111.89    -2.6202    5.2746    0.53966    0.55468    0.50989     0.012264     -0.26755      0.3318    0.32108     -0.18096    -0.44212    -0.21208     -0.21385    'fejs'
    'an36-fejs-b.flac'    111.11    -2.6138    5.0492    0.68513    0.40281    0.36792      0.13352     -0.07321     0.25863    0.25314      -0.1787    -0.51149    -0.14679    -0.077431    'fejs'
    'an36-fejs-b.flac'     110.1    -2.4483    5.5192    0.64449    0.44857    0.25178      0.25716     0.042426     0.32466    0.17774       -0.194    -0.70127    -0.16868    -0.041083    'fejs'

Notice that the pitch and MFCC are not on the same scale. This will bias the classifier. Normalize the features by subtracting the mean and dividing the standard deviation of each column.

featureVectors = features{:,2:15};

m = mean(featureVectors);
s = std(featureVectors);
features{:,2:15} = (featureVectors-m)./s;
head(features)   % Display the first few rows
ans =
  8×16 table
         Filename          Pitch      MFCC1       MFCC2       MFCC3       MFCC4       MFCC5      MFCC6       MFCC7        MFCC8       MFCC9      MFCC10       MFCC11      MFCC12      MFCC13     Label 
    __________________    _______    ________    _______    _________    ________    _______    ________    ________    _________    _______    _________    ________    ________    ________    ______
    'an36-fejs-b.flac'    0.90535     -1.8778    0.11469      0.25866    -0.41449    0.97803    -0.34062    -0.22379    -0.031962    0.62995      0.81708    -0.29036       -0.47    -0.04532    'fejs'
    'an36-fejs-b.flac'    0.98367     -1.8143     1.2196      0.32634    -0.65383     0.9623    -0.52854    -0.61983      0.70533    0.40654    -0.066892    -0.60354    -0.53907    -0.51924    'fejs'
    'an36-fejs-b.flac'    0.81806     -1.4342     1.3996      0.27266    -0.77005     1.0279    -0.61349    -0.69793      0.72491    0.56842      -0.6335    -0.60484    -0.62735     -1.0637    'fejs'
    'an36-fejs-b.flac'     0.8025     -1.1209     1.4881    0.0057061    -0.78058     1.2356     -0.7397    -0.49906       1.1048    0.39759     -0.57164    -0.36655    -0.67672    -0.87127    'fejs'
    'an36-fejs-b.flac'    -1.0579     -1.0516     1.6394     0.026102    -0.69355      1.116    -0.85012    -0.58429      0.86925    0.83105     -0.66218    -0.84113    -0.66903    -0.60151    'fejs'
    'an36-fejs-b.flac'    -1.0674    -0.85121     1.5999     0.074074    -0.79501     1.0355    -0.56555    -0.58294       0.5838    0.84584     -0.57511     -1.1255    -0.64767     -0.8494    'fejs'
    'an36-fejs-b.flac'    -1.0797     -0.8476     1.4231      0.21274    -0.92891    0.82602    -0.34877    -0.20402      0.41231    0.67644     -0.56908     -1.3199    -0.43764    -0.38468    'fejs'
    'an36-fejs-b.flac'    -1.0955    -0.75325     1.7918        0.174    -0.88856    0.65463    -0.12772    0.021446      0.56706    0.48842     -0.60995     -1.8518    -0.50805    -0.26085    'fejs'

Training a Classifier

Now that you have collected features for all ten speakers, you can train a classifier based on them. In this example, you use a K-nearest neighbor classifier defined in HelperTrainKNNClassifier. K-nearest neighbor is a classification technique naturally suited for multi-class classification. The hyperparameters for the nearest neighbor classifier include the number of nearest neighbors, the distance metric used to compute distance to the neighbors, and the weight of the distance metric. The hyperparameters are selected to optimize validation accuracy and performance on the test set. In this example, the number of neighbors is set to 5 and the metric for distance chosen is squared-inverse weighted Euclidean distance. For more information about the classifier, refer to fitcknn.

Train the classifier and print the cross-validation accuracy. crossval and kfoldloss are used to compute the cross-validation accuracy for the KNN classifier. Use confusionmat to compute the confusion matrix (in percentage) and plot the result.

[trainedClassifier, validationAccuracy, confMatrix] = ...
    HelperTrainKNNClassifier(features);
fprintf('\nValidation accuracy = %.2f%%\n', validationAccuracy*100);
heatmap(trainedClassifier.ClassNames, trainedClassifier.ClassNames, ...
    confMatrix);
title('Confusion Matrix');
Validation accuracy = 92.76%

You can also use the classificationLearner app to try out and compare various classifiers with your table of features.

Testing the Classifier

In this section, you will test the trained KNN classifier with two speech signals from each of the ten speakers to see how well it behaves with signals that were not used to train it.

Read files, extract features from the test set, and normalize them.

lenDataTest = length(testDatastore.Files);
featuresTest = cell(lenDataTest,1);
for i = 1:lenDataTest
  [dataTest, infoTest] = read(testDatastore);
  featuresTest{i} = HelperComputePitchAndMFCC(dataTest,infoTest);
end
featuresTest = vertcat(featuresTest{:});
featuresTest = rmmissing(featuresTest);
featuresTest{:,2:15} = (featuresTest{:,2:15}-m)./s;
head(featuresTest)   % Display the first few rows
ans =
  8×16 table
         Filename          Pitch      MFCC1       MFCC2       MFCC3       MFCC4       MFCC5       MFCC6       MFCC7      MFCC8      MFCC9     MFCC10     MFCC11      MFCC12      MFCC13     Label 
    __________________    _______    ________    ________    ________    ________    _______    _________    _______    _______    _______    _______    _______    ________    ________    ______
    'cen6-fejs-b.flac'    0.80321     -2.3423    -0.58597    0.037737    -0.54468    0.92912    -0.011486    0.19081    0.52413    0.53979     1.3608     1.4529    0.072385    -0.36157    'fejs'
    'cen6-fejs-b.flac'    0.87164     -1.8071    -0.01602     0.46226    -0.40632    0.58718     -0.16343    0.59531    0.74482     1.1505     1.4122     1.0592     0.68586     -0.2466    'fejs'
    'cen6-fejs-b.flac'    0.71543     -1.5509      0.4368     0.59678    -0.30302    0.16468     -0.36499    0.60088    0.86908     1.0306     1.4249     1.2195      1.2085     -0.3206    'fejs'
    'cen6-fejs-b.flac'    0.60687     -1.3838     0.37673     0.69429    -0.24033    0.29205     -0.27508    0.28735    0.49197    0.99254     1.6082     1.1646     0.89275    -0.34946    'fejs'
    'cen6-fejs-b.flac'    0.51015     -1.2784     0.15729     0.71635    -0.28851    0.85877       0.4355    0.35029    0.21002    0.43132      1.444     1.3727     0.75646     -0.3879    'fejs'
    'cen6-fejs-b.flac'    0.48248     -1.2499    -0.27023     0.35115    -0.59021     1.0208      0.82701    0.63592    0.84897    0.69597    0.86749    0.65908     0.58337    0.050057    'fejs'
    'cen6-fejs-b.flac'    0.48248     -1.1657    -0.66573    0.067037    -0.98629    0.64687      0.70306    0.32048    0.42064    0.57206    0.66513    0.31176     0.61106     0.19282    'fejs'
    'cen6-fejs-b.flac'     0.6509    -0.66006    -0.42247     0.23545    -0.91029    0.64841      0.61746    0.12066     0.0988    0.33263    0.49967    0.18377     0.54699    0.074648    'fejs'

If you didn't download the AN4 database, you can load the table of features for test files from HelperAN4TestingFeatures.mat.

The function HelperTestKNNClassifier performs the following actions for every file in testDatastore:

  1. Read audio samples and compute pitch and MFCC features for each 30 ms frame, as described in the Feature Extraction section.
  2. Predict the label (speaker) for each frame by calling predict on trainedClassifier.
  3. For a given file, predictions are made for every frame. The most frequently occuring label is declared as the predicted speaker for the file. Prediction confidence is computed as the frequency of prediction of the label divided by the total number of voiced frames in the file.
result = HelperTestKNNClassifier(trainedClassifier, featuresTest)
result =
  28×4 table
         Filename         ActualSpeaker    PredictedSpeaker    ConfidencePercentage
    __________________    _____________    ________________    ____________________
    'cen6-fejs-b.flac'        fejs               fejs                  94.41       
    'cen6-fmjd-b.flac'        fmjd               fmjd                 75.776       
    'cen6-fsrb-b.flac'        fsrb               fsrb                 56.219       
    'cen6-ftmj-b.flac'        ftmj               ftmj                 58.974       
    'cen6-fwxs-b.flac'        fwxs               fwxs                 71.493       
    'cen6-mcen-b.flac'        mcen               mcen                 74.359       
    'cen6-mrcb-b.flac'        mrcb               mrcb                 79.845       
    'cen6-msjm-b.flac'        msjm               msjm                 60.714       
    'cen6-msjr-b.flac'        msjr               msjr                 62.759       
    'cen7-fejs-b.flac'        fejs               fejs                 76.642       
    'cen7-fmjd-b.flac'        fmjd               fmjd                 74.654       
    'cen7-fsrb-b.flac'        fsrb               fsrb                 71.279       
    'cen7-ftmj-b.flac'        ftmj               ftmj                 37.915       
    'cen7-fwxs-b.flac'        fwxs               fwxs                 71.131       
    'cen7-mcen-b.flac'        mcen               mcen                  76.25       
    'cen7-mrcb-b.flac'        mrcb               mrcb                 78.788       
    'cen7-msjm-b.flac'        msjm               msjm                 59.184       
    'cen7-msjr-b.flac'        msjr               msjr                 78.295       
    'cen7-msmn-b.flac'        msmn               msmn                 72.901       
    'cen8-fejs-b.flac'        fejs               fejs                 85.185       
    'cen8-fmjd-b.flac'        fmjd               fmjd                 57.317       
    'cen8-fsrb-b.flac'        fsrb               fsrb                     50       
    'cen8-ftmj-b.flac'        ftmj               ftmj                 63.547       
    'cen8-mcen-b.flac'        mcen               mcen                 76.812       
    'cen8-mrcb-b.flac'        mrcb               mrcb                 72.561       
    'cen8-msjm-b.flac'        msjm               msjm                 43.165       
    'cen8-msjr-b.flac'        msjr               msjr                 66.667       
    'cen8-msmn-b.flac'        msmn               msmn                 87.991       

The predicted speakers match the expected speakers for all files under test.

The experiment was repeated using an internally developed dataset. The dataset consists of 20 speakers with each speaker speaking multiple sentences from the Harvard sentence list [2]. For 20 speakers, the validation accuracy was found to be 89%.

References

[1] http://www.speech.cs.cmu.edu/databases/an4/

[2] http://en.wikipedia.org/wiki/Harvard_sentences