Documentation

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English verison of the page.

Note: This page has been translated by MathWorks. Please click here
To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

Speaker Identification Using Pitch and MFCC

This example demonstrates a machine learning approach to identify people based on features extracted from recorded speech. The features used to train the classifier are: pitch of the voiced segments of the speech, and the Mel Frequency Cepstrum Coefficients (MFCC). This is a closed-set speaker identification - the audio of the speaker under test is compared against all the available speaker models (a finite set) and the closest match is returned.

This example requires Statistics and Machine Learning Toolbox.

Introduction

The approach used in this example for speaker identification is shown in the diagram.

Pitch and Mel Frequency Cepstrum Coefficients (MFCC) are extracted from speech signals recorded for 10 speakers. These features are used to train a K-Nearest Neighbor (KNN) classifier. Then, new speech signals that need to be classified go through the same feature extraction. The trained KNN classifier predicts which one of the ten speakers is the closest match.

Features Used for Classification

Pitch

Speech can be broadly categorized as voiced and unvoiced. In the case of voiced speech, air from the lungs is modulated by vocal cords and results in a quasi-periodic excitation. The resulting sound is dominated by a relatively low-frequency oscillation, referred to as pitch. In the case of unvoiced speech, air from the lungs passes through a constriction in the vocal tract and becomes a turbulent, noise-like excitation. In the source-filter model of speech, the excitation is referred to as the source, and the vocal tract is referred to as the filter. Characterizing the source is an important part of characterizing the speech system.

As an example of voiced and unvoiced speech, consider a time-domain representation of the word "two" (/T UW/). The consonant /T/ (unvoiced speech) looks like noise, while the vowel /UW/ (voiced speech) is characterized by a strong fundamental frequency.

audioIn = audioread('Counting-16-44p1-mono-15secs.wav');
audioIn = audioIn(110e3:135e3,:);
timeVector = linspace(0,(25/44.1),numel(audioIn));
plot(timeVector,audioIn);
axis([0 (25/44.1) -1 1]);
ylabel('Amplitude');
xlabel('Time (s)');
title('Utterance - Two')

The simplest method to distinguish between voiced and unvoiced speech is to analyze the zero crossing rate. A large number of zero crossings implies that there is no dominate low-frequency oscillation.

Once you isolate a region of voiced speech, you can characterize it by estimating the pitch. There are several popular approaches to estimating pitch, including time-domain peak-picking, the harmonic product spectrum, cepstral analysis, and autocorrelation. This example uses the autocorrelation approach.

In particular, the autocorrelation algorithm used in this example follows the algorithm described in Theory and Application of Digital Speech Processing [1]:

  1. Center clip the speech signal to remove damped oscillations due to the vocal tract response.

  2. Upsample the signal for interpolated resolution.

  3. Perform autocorrelation to determine the lag in samples.

  4. Interpret the lag as the pitch period and then convert the pitch period to pitch in Hz.

Apply pitch detection to the word "two" to see how pitch changes over time. This is known as the pitch contour, and is characteristic to a speaker.

pD = audiopluginexample.SpeechPitchDetector;
[~,pitch] = process(pD,audioIn);

figure;
subplot(2,1,1);
plot(timeVector,audioIn);
axis([0 (25/44.1) -1 1])
ylabel('Amplitude')
xlabel('Time (s)')
title('Utterance - Two')

subplot(2,1,2)
plot(timeVector,pitch,'*')
axis([0 (25/44.1) 80 140])
ylabel('Pitch (Hz)')
xlabel('Time (s)');

Mel-Frequency Cepstrum Coefficients (MFCC)

Mel-Frequency Cepstrum Coefficients (MFCC) are popular features extracted from speech signals for use in recognition tasks. In the source-filter model of speech, MFCCs are understood to represent the filter (vocal tract). The vocal tract frequency response is relatively smooth, whereas the source of voiced speech can be modeled as an impulse train. The result is that the vocal tract can be estimated by the spectral envelope of a speech segment.

The motivating idea of MFCC is to compress information about the vocal tract (smoothed spectrum) into a small number of coefficients based on an understanding of the cochlea.

Although there is no hard standard for calculating MFCC, the basic steps are outlined by the diagram.

This example uses a single-pole highpass filter for preemphasis and a Hamming function for windowing. The mel filterbank uses 10 linearly spaced triangular filters and then logarithmically spaces them thereafter. The individual bands are weighted for even energy.

mfcc = audioexample.MelFrequencyCepstralCoefficients;
figure;
visualize(mfcc);
title('Typical Mel Filterbank')

Generally, a speech signal is analyzed in 20-40 ms windows with overlap. This example uses a 30 ms window with a 75% overlap.

Data Set

This example uses the Census Database (also known as AN4 Database) from the CMU Robust Speech Recognition Group. The data set contains recordings of male and female subjects speaking words and numbers. You can download it from [2]. The speech files are partitioned into subdirectories based on the labels corresponding to the speakers. If you are unable to download it, you can load a table of features from HelperAN4TrainingFeatures.mat and proceed directly to Training a Classifier section. The features have been extracted from the same data set.

Download and extract the speech files for 10 speakers (5 female and 5 male) into a temporary directory.

d = HelperAN4Download;  % Path to the directory named 'an4' in the database
Downloading AN4 dataset... done.
Reducing dataset to 5 females and 5 males... done.

Create a datastore to easily manage this database for training. The datastore allows you to collect necessary files of a file format and to run a custom read function on the files. Remove files that start with 'cen1-' and 'cen2-' because they will be used later for testing the classifier.

trainingDatabase = datastore(d, 'IncludeSubfolders', true,...
    'FileExtensions', '.raw', 'Type', 'file', 'UniformRead', true, ...
    'ReadFcn', @HelperComputePitchAndMFCC);
fidx = cellfun(@(x) contains(x,{'cen1-','cen2-'}), trainingDatabase.Files);
trainingDatabase.Files(fidx,:) = []
trainingDatabase = 

  FileDatastore with properties:

          Files: {
                 ' ...\Temp\an4\wav\an4_clstk\fash\an251-fash-b.raw';
                 ' ...\Temp\an4\wav\an4_clstk\fash\an253-fash-b.raw';
                 ' ...\Temp\an4\wav\an4_clstk\fash\an254-fash-b.raw'
                  ... and 102 more
                 }
    UniformRead: 1
        ReadFcn: @HelperComputePitchAndMFCC

Feature Extraction

When you read from the files in trainingDatabase, it uses HelperComputePitchAndMFCC to perform the following actions on each audio file:

  1. Read audio samples (stored as 16-bit integers) and convert them to double precision.

  2. Collect the samples read from the file into frames of 30 ms with an overlap of 75%.

  3. For each frame, use audiopluginexample.SpeechPitchDetector.isVoicedSpeech to decide whether the samples correspond to a voiced speech segment.

  4. For the voiced speech segments, compute the pitch using audiopluginexample.SpeechPitchDetector.autoCorrelationPitchDecision.

  5. For the voiced speech segments, compute 13 MFCCs using audioexample.MelFrequencyCepstralCoefficients.

  6. Get the directory name for the file. This corresponds to the name of the speaker and will be used as a label for training the classifier.

HelperComputePitchAndMFCC returns a table containing the filename, pitch, MFCCs, and label as columns for each 30 ms frame. The readall function loops over files in trainingDatabase and creates a table by concatenating the output of HelperComputePitchAndMFCC row-wise.

features = readall(trainingDatabase);
features = rmmissing(features);
head(features)   % Display the first few rows
ans =

  8×16 table

         Filename         Pitch      MFCC1      MFCC2       MFCC3     MFCC4      MFCC5        MFCC6        MFCC7        MFCC8       MFCC9       MFCC10       MFCC11       MFCC12       MFCC13      Label 
    __________________    ______    _______    ________    _______    ______    ________    __________    ________    _________    ________    _________    _________    _________    _________    ______

    'an251-fash-b.raw'    210.87    -4.0064     -1.0502      2.213    1.5944     0.16356       0.59052    -0.19603      0.45179    -0.76136    -0.016874    -0.041221     -0.25254    0.0019384    'fash'
    'an251-fash-b.raw'    214.41    -3.5244     -1.7537     1.7664    1.8763     0.11157        0.3083    -0.21331      0.36675    -0.55982      0.22364     -0.16682     -0.34883     -0.15943    'fash'
    'an251-fash-b.raw'    213.33    -2.9804     -1.5177     1.4508    1.5171    0.045945       0.22282     -0.1571      0.33173    -0.56554     0.056569     -0.24292     -0.44889     -0.21427    'fash'
    'an251-fash-b.raw'    213.33    -2.4698     -1.4726     1.3716    1.5239    0.053013       0.16864     -0.2833       0.1835    -0.69092    -0.050473     -0.24715     -0.31795     -0.15723    'fash'
    'an251-fash-b.raw'    208.81    -2.3291     -1.1027     1.3191    1.6805    -0.35292       0.12771    -0.52106    -0.025441    -0.66446    -0.055634     -0.30019     -0.20288     -0.20319    'fash'
    'an251-fash-b.raw'    212.98    -2.2898    -0.80043    0.91946    1.6846    -0.73457       0.11446    -0.62509      0.15869     -0.5014    -0.029822     -0.36344     -0.21657     -0.21222    'fash'
    'an251-fash-b.raw'    207.46    -2.3984    -0.72929    0.52576    1.5897     -0.7562    -0.0039132    -0.69402      0.22924    -0.57049     0.036349     -0.33666    -0.019235    -0.089926    'fash'
    'an251-fash-b.raw'    202.53    -2.5082    -0.76727    0.28545    1.3773    -0.61141      -0.32315    -0.55417      0.32933    -0.62427      0.15648     -0.16164      0.28895    -0.064155    'fash'

Notice that the pitch and MFCC are not on the same scale. This will bias the classifier. Normalize the features by subtracting the mean and dividing the standard deviation of each column.

m = mean(features{:,2:15});
s = std(features{:,2:15});
features{:,2:15} = (features{:,2:15}-m)./s;
head(features)   % Display the first few rows
ans =

  8×16 table

         Filename          Pitch      MFCC1       MFCC2      MFCC3      MFCC4      MFCC5        MFCC6       MFCC7        MFCC8       MFCC9      MFCC10       MFCC11      MFCC12      MFCC13      Label 
    __________________    _______    ________    _______    _______    _______    ________    _________    ________    _________    _______    _________    ________    ________    _________    ______

    'an251-fash-b.raw'    0.99232     -1.7968    -1.7525     2.4287    0.38834     0.90167       0.8702    -0.23872       1.1208    -2.1363    -0.056737    -0.22648    -0.64215    -0.037807    'fash'
    'an251-fash-b.raw'      1.066     -1.5036     -2.317      1.959     0.6519     0.82339      0.26278    -0.27317      0.90868    -1.5951      0.57194    -0.62013    -0.95973     -0.61532    'fash'
    'an251-fash-b.raw'     1.0437     -1.1728    -2.1276     1.6269    0.31608     0.72456      0.07882    -0.16111      0.82132    -1.6105      0.13524    -0.85867     -1.2898     -0.81158    'fash'
    'an251-fash-b.raw'     1.0437    -0.86225    -2.0915     1.5436    0.32244     0.73521    -0.037798    -0.41271      0.45149    -1.9471     -0.14456    -0.87194    -0.85787     -0.60746    'fash'
    'an251-fash-b.raw'    0.94924    -0.77669    -1.7946     1.4884    0.46882     0.12393     -0.12589     -0.8867    -0.069769    -1.8761     -0.15805     -1.0382    -0.47833     -0.77192    'fash'
    'an251-fash-b.raw'     1.0363    -0.75277    -1.5521      1.068    0.47271    -0.45079     -0.15442     -1.0941      0.38961    -1.4382    -0.090581     -1.2364    -0.52349     -0.80427    'fash'
    'an251-fash-b.raw'    0.92099    -0.81882     -1.495    0.65392    0.38397    -0.48335     -0.40918     -1.2315      0.56562    -1.6238     0.082384     -1.1525     0.12739     -0.36658    'fash'
    'an251-fash-b.raw'    0.81823    -0.88562    -1.5255    0.40115    0.18543    -0.26532      -1.0963     -0.9527      0.81531    -1.7682      0.39641     -0.6039      1.1439     -0.27435    'fash'

Training a Classifier

Now that you have collected features for all ten speakers, you can train a classifier based on them. In this example, you use a K-nearest neighbor classifier defined in HelperTrainKNNClassifier. The number of neighbors is set to 5 and the metric for distance is squared-inverse weighted Euclidean distance. For more information about the classifier, refer to fitcknn.

Train the classifier and print the accuracy for 5-fold cross-validation. crossval and kfoldloss are used to compute the cross-validation accuracy for the KNN classifier. Also plot the confusion matrix (in percentage) for validation computed using confusionmat.

[trainedClassifier, validationAccuracy, confMatrix] = ...
    HelperTrainKNNClassifier(features);
fprintf('\nValidation accuracy = %.2f%%\n', validationAccuracy*100);
heatmap(trainedClassifier.ClassNames, trainedClassifier.ClassNames, ...
    confMatrix);
title('Confusion Matrix');
Validation accuracy = 91.73%

You can also use the classificationLearner app to try out and compare various classifiers with your table of features.

Testing the Classifier

In this section, you will test the trained KNN classifier with two speech signals from each of the ten speakers to see how well it behaves with signals that were not used to train it. First, create a datastore with files that were skipped when trainingDatabase was created. Use the same ReadFcn as trainingDatabase because you will extract the same features for these files as well.

testingDatabase = datastore(d, 'IncludeSubfolders', true,...
    'FileExtensions', '.raw', 'Type', 'file', 'UniformRead', true, ...
    'ReadFcn', @HelperComputePitchAndMFCC);
testingDatabase.Files = testingDatabase.Files(fidx,:)
testingDatabase = 

  FileDatastore with properties:

          Files: {
                 ' ...\Temp\an4\wav\an4_clstk\fash\cen1-fash-b.raw';
                 ' ...\Temp\an4\wav\an4_clstk\fash\cen2-fash-b.raw';
                 ' ...\Temp\an4\wav\an4_clstk\fbbh\cen1-fbbh-b.raw'
                  ... and 17 more
                 }
    UniformRead: 1
        ReadFcn: @HelperComputePitchAndMFCC

Read features from the test files and normalize them.

features_test = readall(testingDatabase);
features_test = rmmissing(features_test);
features_test{:,2:15} = (features_test{:,2:15}-m)./s;
head(features_test)   % Display the first few rows
ans =

  8×16 table

        Filename          Pitch       MFCC1        MFCC2        MFCC3       MFCC4        MFCC5       MFCC6      MFCC7       MFCC8      MFCC9       MFCC10      MFCC11      MFCC12      MFCC13      Label 
    _________________    ________    ________    _________    _________    ________    _________    _______    ________    _______    ________    ________    _________    _______    _________    ______

    'cen1-fash-b.raw'     -2.0323     -5.6807      -1.5922      0.48134    -0.58046    -0.061541    0.09925    0.013803    -1.2015    -0.27072    -0.46243      0.08811     1.0781      -0.9913    'fash'
    'cen1-fash-b.raw'    -0.91895    -0.37378    -0.086411    0.0067829     0.14843      -1.1016    -1.3993    -0.56533    0.48834     -1.5904     0.86476      0.46134     1.1291      0.41947    'fash'
    'cen1-fash-b.raw'      1.3195       -0.19     -0.98955      0.25292     0.25143       -1.471    -1.8266    -0.55482    0.72272    -0.88191      1.1742      0.13419     0.8638      0.54803    'fash'
    'cen1-fash-b.raw'      1.3195    -0.21527      -1.2738       0.3439     0.44032      -1.0095    -1.7379    -0.39505    0.93424    -0.77626      1.1346      0.18927    0.46789     0.079099    'fash'
    'cen1-fash-b.raw'       1.278    -0.22111      -1.3454      0.14081     0.30305     -0.87445    -1.4075    -0.20187    0.79434    -0.65184      1.1313    -0.001341    0.41575      0.24455    'fash'
    'cen1-fash-b.raw'      1.2616     -0.1641      -1.3085      0.13079     0.27347      -1.1728    -1.6946    -0.55035    0.81168    -0.71464      1.4796      0.34542    0.85779     0.076451    'fash'
    'cen1-fash-b.raw'      1.0735    -0.23874      -1.1989     0.040715     0.40416      -1.0138    -1.5503    -0.24224    0.50165     -1.2627       1.448      0.40212     1.2164      0.16663    'fash'
    'cen1-fash-b.raw'      1.2131     -0.2233      -1.0285      0.16732      0.4626       -0.868    -1.8286    -0.38942    0.21822     -1.3066      1.4891     0.073674     1.0005    -0.012193    'fash'

If you didn't download the AN4 database, you can load the table of features for test files from HelperAN4TestingFeatures.mat.

The function HelperTestKNNClassifier performs the following actions for every file in testingDatabase:

  1. Read audio samples and compute pitch and MFCC features for each 30 ms frame, as described in the Feature Extraction section.

  2. Predict the label (speaker) for each frame by calling predict on trainedClassifier.

  3. Find the label that is predicted for a majority of frames. Declare this label as the predicted speaker for the file. Compute the prediction confidence as the frequency of prediction of the label divided by the total number of voiced frames in the file.

result = HelperTestKNNClassifier(trainedClassifier, features_test)
result =

  20×4 table

        Filename         ActualSpeaker    PredictedSpeaker    ConfidencePercentage
    _________________    _____________    ________________    ____________________

    'cen1-fash-b.raw'    fash             fash                73.184              
    'cen1-fbbh-b.raw'    fbbh             fbbh                    56              
    'cen1-fclc-b.raw'    fclc             fclc                77.061              
    'cen1-fejs-b.raw'    fejs             fejs                87.054              
    'cen1-ffmm-b.raw'    ffmm             ffmm                51.341              
    'cen1-mblb-b.raw'    mblb             mblb                76.437              
    'cen1-mblw-b.raw'    mblw             mblw                36.719              
    'cen1-mbmg-b.raw'    mbmg             mbmg                 77.66              
    'cen1-mcel-b.raw'    mcel             mcel                63.525              
    'cen1-mcen-b.raw'    mcen             mcen                82.967              
    'cen2-fash-b.raw'    fash             fash                35.514              
    'cen2-fbbh-b.raw'    fbbh             fbbh                64.706              
    'cen2-fclc-b.raw'    fclc             fclc                78.378              
    'cen2-fejs-b.raw'    fejs             fejs                86.957              
    'cen2-ffmm-b.raw'    ffmm             ffmm                48.011              
    'cen2-mblb-b.raw'    mblb             mblb                78.082              
    'cen2-mblw-b.raw'    mblw             mblw                44.248              
    'cen2-mbmg-b.raw'    mbmg             mbmg                78.065              
    'cen2-mcel-b.raw'    mcel             mcel                 50.35              
    'cen2-mcen-b.raw'    mcen             mcen                54.362              

The predicted speakers match the expected speakers for all files under test.

References

[1] Rabiner, Lawrence R., and Ronald W. Schafer. Theory and Applications of Digital Speech Processing. Upper Saddle River, NJ: Prentice Hall, 2011.

[2] http://www.speech.cs.cmu.edu/databases/an4/

Was this topic helpful?