Documentation

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

Deep Learning Speech Recognition

This example shows how to train a simple deep learning model that detects the presence of speech commands in audio. The example uses the Speech Commands Dataset [1] to train a convolutional neural network to recognize a given set of commands.

To run the whole example, you must first download the data set. If you do not want to download the data set or train the network, then you can load a pretrained network by typing load('commandNet.mat') at the command line. Then, go directly to the Detect Commands Using Streaming Audio from Microphone section at the end of the example.

Load Speech Commands Data Set

Download the data set from http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz and untar the downloaded file. Set datafolder to the location of the data. Use audioexample.Datastore to create a datastore that contains the file names and the corresponding labels. Use the folder names as label source. Specify the read method to read the entire audio file. Create a copy of the datastore for future use.

datafolder = fullfile(tempdir,'speech_commands_v0.01');

addpath(fullfile(matlabroot,'toolbox','audio','audiodemos'))
ads = audioexample.Datastore(datafolder, ...
    'IncludeSubfolders',true, ...
    'FileExtensions','.wav', ...
    'LabelSource','foldernames', ...
    'ReadMethod','File')
ads0 = copy(ads);
ads = 

  Datastore with properties:

              Files: {
                     ' ...\Temp\speech_commands_v0.01\_background_noise_\doing_the_dishes.wav';
                     ' ...\Local\Temp\speech_commands_v0.01\_background_noise_\dude_miaowing.wav';
                     ' ...\Local\Temp\speech_commands_v0.01\_background_noise_\exercise_bike.wav'
                      ... and 64724 more
                     }
             Labels: [_background_noise_; _background_noise_; _background_noise_ ... and 64724 more categorical]
        ReadMethod: 'File'
    OutputDataType: 'double'

Choose Words to Recognize

Specify which words are the commands that you want your model to recognize. Label all words that are not among the commands as unknown. The idea is that these words should somehow approximate the distribution of all words other than the commands. To reduce the class imbalance between the known and unknown words, only include each unknown word with a certain probability. Do not include the longer files with background noise in the _background_noise_ folder.

Use getSubsetDatastore(ads,indices) to create a datastore that only contains the files and labels indexed by indices. Reduce the datastore ads to only contain the commands and the subset of unknown words. Count the number of examples belonging to each class.

commands = ["yes","no","up","down","left","right","on","off","stop","go"];

isCommand = ismember(ads.Labels,categorical(commands));
isUnknown = ~ismember(ads.Labels,categorical([commands,"_background_noise_"]));

probIncludeUnknown = 0.1;
mask = rand(numel(ads.Labels),1) < probIncludeUnknown;
isUnknown = isUnknown & mask;
ads.Labels(isUnknown) = categorical("unknown");

ads = getSubsetDatastore(ads,isCommand|isUnknown);
countEachLabel(ads)
ans =

  11×2 table

     Label     Count
    _______    _____

    down       2359 
    go         2372 
    left       2353 
    no         2375 
    off        2357 
    on         2367 
    right      2367 
    stop       2380 
    unknown    4143 
    up         2375 
    yes        2377 

Split Data into Training, Validation, and Test Sets

The data set folder contains text files with a list of sound files to use as validation and test sets. Because the data set contains multiple utterances of the same word by the same person, it is better to use these predefined sets than to select a random subset of the whole data set. Use the supporting function splitData to split the data store into a training, validation, and test datastore based on the list of validation and test files in datafolder.

[adsTrain,adsValidation,adsTest] = splitData(ads,datafolder);

Compute Speech Spectrograms

To prepare the data for efficient training of a convolutional neural network, convert the speech waveforms to log-bark auditory spectrograms.

Define the parameters of the spectrogram calculation. segmentDuration is the duration of each speech clip (in seconds). frameDuration is the duration of each frame for spectrogram calculation. hopDuration is the time step between each column of the spectrogram. numBands is the number of log-bark filters and equals the height of each spectrogram.

segmentDuration = 1;
frameDuration = 0.025;
hopDuration = 0.010;
numBands = 40;

Compute the spectrograms for all the training, validation, and test sets by using the supporting function speechSpectrograms. The speechSpectrograms function uses auditorySpectrogram for the spectrogram calculations. To obtain data with a smoother distribution, take the logarithm of the spectrograms using a small offset epsil.

addpath(fullfile(matlabroot,'examples','audio','main'))
epsil = 1e-6;

XTrain = speechSpectrograms(adsTrain,segmentDuration,frameDuration,hopDuration,numBands);
XTrain = log10(XTrain + epsil);

XValidation = speechSpectrograms(adsValidation,segmentDuration,frameDuration,hopDuration,numBands);
XValidation = log10(XValidation + epsil);

XTest = speechSpectrograms(adsTest,segmentDuration,frameDuration,hopDuration,numBands);
XTest = log10(XTest + epsil);

YTrain = adsTrain.Labels;
YValidation = adsValidation.Labels;
YTest = adsTest.Labels;
Computing speech spectrograms...
Processed 1000 files out of 21843
Processed 2000 files out of 21843
Processed 3000 files out of 21843
Processed 4000 files out of 21843
Processed 5000 files out of 21843
Processed 6000 files out of 21843
Processed 7000 files out of 21843
Processed 8000 files out of 21843
Processed 9000 files out of 21843
Processed 10000 files out of 21843
Processed 11000 files out of 21843
Processed 12000 files out of 21843
Processed 13000 files out of 21843
Processed 14000 files out of 21843
Processed 15000 files out of 21843
Processed 16000 files out of 21843
Processed 17000 files out of 21843
Processed 18000 files out of 21843
Processed 19000 files out of 21843
Processed 20000 files out of 21843
Processed 21000 files out of 21843
...done
Computing speech spectrograms...
Processed 1000 files out of 2985
Processed 2000 files out of 2985
...done
Computing speech spectrograms...
Processed 1000 files out of 2997
Processed 2000 files out of 2997
...done

Visualize Data

Plot the waveforms and spectrograms of a few training examples. Play the corresponding audio clips.

specMin = min(XTrain(:));
specMax = max(XTrain(:));
idx = randperm(size(XTrain,4),3);
figure('Units','normalized','Position',[0.2 0.2 0.6 0.6]);
for i = 1:3
    [x,fs] = audioread(adsTrain.Files{idx(i)});
    subplot(2,3,i)
    plot(x)
    axis tight
    title(string(adsTrain.Labels(idx(i))))

    subplot(2,3,i+3)
    spect = XTrain(:,:,1,idx(i));
    pcolor(spect)
    caxis([specMin+2 specMax])
    shading flat

    sound(x,fs)
    pause(2)
end

Neural networks train most easily when their inputs have a reasonably smooth distribution and are normalized. To check that data distribution is smooth, plot a histogram of the pixel values of the training data.

figure
histogram(XTrain,'EdgeColor','none','Normalization','pdf')
axis tight
ax = gca;
ax.YScale = 'log';
xlabel("Input Pixel Value")
ylabel("Probability Density")

Add Background Noise Data

The network should not only be able to recognize different spoken words. It should also be able to detect if a word is spoken at all, or if the input only contains background noise.

Use the audio files in the _background_noise_ folder to create samples of one-second clips of background noise. Create an equal number of background clips from each background noise file. You can also create your own recordings of background noise and add them to the _background_noise_ folder. To calculate numBkgClips spectrograms of background clips taken from the audio files in the adsBkg datastore, use the supporting function backgroundSpectrograms. Before calculating spectrograms, the function rescales each audio clip with a factor sampled from a log-uniform distribution in the range given by volumeRange.

Create 4000 background clips and rescale each of them by a number between 1e-4 and 1. XBkg contains spectrograms of background noise with volumes ranging from practically silent to loud.

adsBkg = getSubsetDatastore(ads0, ads0.Labels=="_background_noise_");
numBkgClips = 4000;
volumeRange = [1e-4,1];

XBkg = backgroundSpectrograms(adsBkg,numBkgClips,volumeRange,segmentDuration,frameDuration,hopDuration,numBands);
XBkg = log10(XBkg + epsil);
Computing background spectrograms...
Processed 1000 background clips out of 4000
Processed 2000 background clips out of 4000
Processed 3000 background clips out of 4000
Processed 4000 background clips out of 4000
...done

Split the spectrograms of background noise over the training, validation, and test sets. Because the _background_noise_ folder only contains about five and a half minutes of background noise, the background samples in the different data sets are highly correlated. To increase the variation in the background noise, you can create your own background files and add them to the folder. To increase the robustness to noise, you can also try mixing background noise into the speech files.

numTrainBkg = floor(0.8*numBkgClips);
numValidationBkg = floor(0.1*numBkgClips);
numTestBkg = floor(0.1*numBkgClips);

XTrain(:,:,:,end+1:end+numTrainBkg) = XBkg(:,:,:,1:numTrainBkg);
XBkg(:,:,:,1:numTrainBkg) = [];
YTrain(end+1:end+numTrainBkg) = "background";

XValidation(:,:,:,end+1:end+numValidationBkg) = XBkg(:,:,:,1:numValidationBkg);
XBkg(:,:,:,1:numValidationBkg) = [];
YValidation(end+1:end+numValidationBkg) = "background";

XTest(:,:,:,end+1:end+numTestBkg) = XBkg(:,:,:,1: numTestBkg);
clear XBkg;
YTest(end+1:end+numTestBkg) = "background";

YTrain = removecats(YTrain);
YValidation = removecats(YValidation);
YTest = removecats(YTest);

Plot the distribution of the different class labels in the training and validation sets. The test set has a very similar distribution to the validation set.

figure('Units','normalized','Position',[0.2 0.2 0.5 0.5]);
subplot(2,1,1)
histogram(YTrain)
title("Training Label Distribution")
subplot(2,1,2)
histogram(YValidation)
title("Validation Label Distribution")

Add Data Augmentation

Create an augmented image datastore for automatic augmentation and resizing of the spectrograms. Translate the spectrogram randomly up to 10 frames (100 ms) forwards or backwards in time, and scale the spectrograms along the time axis up or down by 20 percent. Augmenting the data somewhat increases the effective size of the training data and helps prevent the network from overfitting. The augmented image datastore creates augmented images in real time and inputs these to the network. No augmented spectrograms are saved in memory.

sz = size(XTrain);
specSize = sz(1:2);
imageSize = [specSize 1];
augmenter = imageDataAugmenter(...
    'RandXTranslation',[-10 10],...
    'RandXScale',[0.8 1.2],...
    'FillValue',log10(epsil));
augimdsTrain = augmentedImageDatastore(imageSize,XTrain,YTrain,...
    'DataAugmentation',augmenter,...
    'OutputSizeMode','randcrop');

Define Neural Network Architecture

Create a simple network architecture as an array of layers. Use convolutional and batch normalization layers, and downsample the feature maps "spatially" (that is, in time and frequency) using max pooling layers. Add a final max pooling layer that pools the input feature map globally over time. This enforces (approximate) time-translation variance in the input spectrograms, which seems reasonable if we expect the network to perform the same classification independent of the exact position of the speech in time. This global pooling also significantly reduces the number of parameters of the final fully connected layer. To reduce the chance of the network memorizing specific features of the training data, add a small amount of dropout to the inputs to the layers with the largest number of parameters. These layers are the convolutional layers with the largest number of filters. The final convolutional layers have 64*64*3*3 = 36864 weights each (plus biases). The final fully connected layer has 12*5*64 = 3840 weights.

Use a weighted cross entropy classification loss. weightedCrossEntropyLayer(classNames,classWeights) creates a custom layer that calculates the weighted cross entropy loss for the classes in classNames using the weights in classWeights. To give each class equal weight in the loss, use class weights that are inversely proportional to the number of training examples of each class. When using the Adam optimizer to train the network, training should be independent of the overall normalization of the class weights.

classNames = categories(YTrain);
classWeights = 1./countcats(YTrain);
classWeights = classWeights/mean(classWeights);
numClasses = numel(classNames);

dropoutProb = 0.2;
layers = [
    imageInputLayer(imageSize)

    convolution2dLayer(3,16,'Padding','same')
    batchNormalizationLayer
    reluLayer

    maxPooling2dLayer(2,'Stride',2)

    convolution2dLayer(3,32,'Padding','same')
    batchNormalizationLayer
    reluLayer

    maxPooling2dLayer(2,'Stride',2,'Padding',[0,1])

    dropoutLayer(dropoutProb)
    convolution2dLayer(3,64,'Padding','same')
    batchNormalizationLayer
    reluLayer

    dropoutLayer(dropoutProb)
    convolution2dLayer(3,64,'Padding','same')
    batchNormalizationLayer
    reluLayer

    maxPooling2dLayer(2,'Stride',2,'Padding',[0,1])

    dropoutLayer(dropoutProb)
    convolution2dLayer(3,64,'Padding','same')
    batchNormalizationLayer
    reluLayer

    dropoutLayer(dropoutProb)
    convolution2dLayer(3,64,'Padding','same')
    batchNormalizationLayer
    reluLayer

    maxPooling2dLayer([1 13])

    fullyConnectedLayer(numClasses)
    softmaxLayer
    weightedCrossEntropyLayer(classNames,classWeights)];

Train Network

Specify the training options. Use the Adam optimizer with a mini-batch size of 128 and a learning rate of 5e-4. Train for 25 epochs and reduce the learning rate by a factor of 10 after 20 epochs.

miniBatchSize = 128;
validationFrequency = floor(numel(YTrain)/miniBatchSize);
options = trainingOptions('adam', ...
    'InitialLearnRate',5e-4, ...
    'MaxEpochs',25, ...
    'MiniBatchSize',miniBatchSize, ...
    'Shuffle','every-epoch', ...
    'Plots','training-progress', ...
    'Verbose',false, ...
    'ValidationData',{XValidation,YValidation}, ...
    'ValidationFrequency',validationFrequency, ...
    'ValidationPatience',Inf, ...
    'LearnRateSchedule','piecewise', ...
    'LearnRateDropFactor',0.1, ...
    'LearnRateDropPeriod',20);

Train the network. If you do not have a GPU, then training the network can take some time. To load a pretrained network instead of training a network from scratch, set doTraining to false.

doTraining = true;
if doTraining
    trainedNet = trainNetwork(augimdsTrain,layers,options);
else
    s = load('commandNet.mat');
    trainedNet = s.trainedNet;
end

Evaluate Trained Network

Calculate the final accuracy on the training set (without data augmentation) and validation set. Plot the confusion matrix. The network is very accurate on this data set. However, the training, validation, and test data all come from similar distributions that do not necessarily reflect real-world environments. This applies in particular to the unknown category which contains utterances of a small number of words only.

YValPred = classify(trainedNet,XValidation);
validationError = mean(YValPred ~= YValidation);
YTrainPred = classify(trainedNet,XTrain);
trainError = mean(YTrainPred ~= YTrain);
disp("Training error: " + trainError*100 + "%")
disp("Validation error: " + validationError*100 + "%")

figure
plotconfusion(YValidation,YValPred,'Validation Data')
Training error: 1.5374%
Validation error: 4.1654%

In applications with constrained hardware resources, such as mobile applications, it is important to respect limitations on available memory and computational resources. Compute the total size of the network in kilobytes, and test its prediction speed when using the CPU. The prediction time is the time for classifying a single input image. If you input multiple images to the network, these can be classified simultaneously, leading to shorter prediction times per image. For this application, however, the single-image prediction time is the most relevant.

info = whos('trainedNet');
disp("Network size: " + info.bytes/1024 + " kB")

for i=1:100
    x = randn(imageSize);
    tic
    [YPredicted,probs] = classify(trainedNet,x,"ExecutionEnvironment",'cpu');
    time(i) = toc;
end
disp("Single-image prediction time on CPU: " + mean(time(11:end))*1000 + " ms")
Network size: 573.665 kB
Single-image prediction time on CPU: 3.0115 ms

Detect Commands Using Streaming Audio from Microphone

Test your newly trained command detection network on streaming audio from your microphone. If you have not trained a network, then type load('commandNet.mat') at the command line to load a pretrained network and the parameters required to classify live, streaming audio. Try speaking one of the speech commands, for example, 'yes', 'no', or 'stop'. Then, try one of the unknown words, such as 'marvin', 'sheila', 'bed', 'house', 'cat', 'bird', or any number from zero to nine.

Specify the audio sampling rate and classification rate in Hz and create an audio device reader to read audio from your microphone.

fs = 16e3;
classificationRate = 20;
audioIn = audioDeviceReader('SampleRate',fs,'SamplesPerFrame',floor(fs/classificationRate));

Specify parameters for the streaming spectrogram computations and initialize a buffer for the audio. Extract the classification labels of the network and initialize buffers of half a second for the labels and classification probabilities of the streaming audio. Use these buffers to build 'agreement' over when a command is detected using multiple frames over half a second.

frameLength = frameDuration*fs;
hopLength = hopDuration*fs;
waveBuffer = zeros([fs,1]);

labels = trainedNet.Layers(end).ClassNames;
YBuffer(1:classificationRate/2) = "background";
probBuffer = zeros([numel(labels),classificationRate/2]);

Create a figure and detect commands as long as the created figure exists. To stop the live detection, simply close the figure. Add the path of the auditorySpectrogram function that calculates the spectrograms.

h = figure('Units','normalized','Position',[0.2 0.1 0.6 0.8]);
addpath(fullfile(matlabroot,'examples','audio','main'))

while ishandle(h)

    % Extract audio samples from audio device and add to the buffer.
    x = audioIn();
    waveBuffer(1:end-numel(x)) = waveBuffer(numel(x)+1:end);
    waveBuffer(end-numel(x)+1:end) = x;

    % Compute the spectrogram of the latest audio samples.
    spec = auditorySpectrogram(waveBuffer,fs, ...
        'WindowLength',frameLength, ...
        'OverlapLength',frameLength-hopLength, ...
        'NumBands',numBands, ...
        'Range',[50,7000], ...
        'WindowType','Hann', ...
        'WarpType','Bark', ...
        'SumExponent',2);
    spec = log10(spec + epsil);

    % Classify the current spectrogram, save the label to the label buffer,
    % and save the predicted probabilities to the probability buffer.
    [YPredicted,probs] = classify(trainedNet,spec,'ExecutionEnvironment','cpu');
    YBuffer(1:end-1)= YBuffer(2:end);
    YBuffer(end) = YPredicted;
    probBuffer(:,1:end-1) = probBuffer(:,2:end);
    probBuffer(:,end) = probs';

    % Plot the current waveform and spectrogram.
    subplot(2,1,1);
    plot(waveBuffer)
    axis tight
    ylim([-0.2,0.2])

    subplot(2,1,2)
    pcolor(spec)
    caxis([specMin+2 specMax])
    shading flat

    % Now do the actual command detection by performing a very simple
    % thresholding operation. Declare a detection and display it in the
    % figure title if all of the following hold:
    % 1) The most common label is not |background|.
    % 2) At least |countThreshold| of the latest frame labels agree.
    % 3) The maximum predicted probability of the predicted label is at least |probThreshold|.
    % Otherwise, do not declare a detection.
    [YMode,count] = mode(YBuffer);
    countThreshold = ceil(classificationRate*0.2);
    maxProb = max(probBuffer(labels == YMode,:));
    probThreshold = 0.7;
    subplot(2,1,1);
    if YMode == "background" || count<countThreshold || maxProb < probThreshold
        title(" ")
    else
        title(YMode,'FontSize',20)
    end

    drawnow

end

References

[1] Warden P. "Speech Commands: A public dataset for single-word speech recognition", 2017. Available from http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz. Copyright Google 2017. The Speech Commands Dataset is licensed under the Creative Commons Attribution 4.0 license, available here: https://creativecommons.org/licenses/by/4.0/legalcode.

See Also

| |

Related Topics

Was this topic helpful?