This MATLAB code shows how to train a deep learning model that detects the presence of kannada letters in audio. The example uses the Kannada Dataset prepared to train a convolutional neural network to recognize a spoken kannada letters.
To train a network from scratch, you must first prepare the data set.
Create Training Datastore
Create an audioDatastore (Audio Toolbox) that points to the training data set. Common practice is to divide a dataset into training dataset and validation dataset in ration of 4:1. This partition helps the network to predict the accuracy of trained network.
Choose Words to Recognize
Specify the words that you want your model to recognize as commands. Label all words that are not commands as unknown. Labeling words that are not commands as unknown creates a group of words that approximates the distribution of all words other than the commands. The network uses this group to learn the difference between commands and all other words.
To reduce the class imbalance between the known and unknown words and speed up processing, only include a fraction (2%) of the unknown words in the training set.
Use subset (Audio Toolbox) to create a datastore that contains only the commands and the subset of unknown words. Count the number of examples belonging to each category.
Compute Auditory Spectrograms
To prepare the data for efficient training of a convolutional neural network, convert the speech waveforms to auditory-based spectrograms.
Define the parameters of the feature extraction. segmentDuration is the duration of each speech clip (in seconds). frameDuration is the duration of each frame for spectrum calculation. hopDuration is the time step between each spectrum. numBands is the number of filters in the auditory spectrogram.
Read a file from the dataset. Training a convolutional neural network requires input to be a consistent size. Some files in the data set are less than 1 second long. Apply zero-padding to the front and back of the audio signal so that it is of length segmentSamples.
To extract audio features, call extract. The output is a mel spectrum with time across rows.
In this example, you post-process the auditory spectrogram by applying a logarithm. Taking a log of small numbers can lead to roundoff error.
Scale the features by the window power and then take the log. To obtain data with a smoother distribution, take the logarithm of the spectrograms using a small offset.
Isolate the train and validation labels. Remove empty categories using removecats Visualize Data
Plot the waveforms and auditory spectrograms of a few training samples. Play the corresponding audio clips. To confirm proper labels assigned to training and validation dataset.
Plot the distribution of the different class labels in the training and validation sets.
Define Neural Network Architecture:
Create a simple network architecture as an array of layers. Use convolutional and batch normalization layers, and down sample the feature maps "spatially" (that is, in time and frequency) using max pooling layers. Add a final max pooling layer that pools the input feature map globally over time. This enforces (approximate) time-translationinvariance in the input spectrograms, allowing the network to perform the same classification independent of the exact position of the speech in time. Global pooling also significantly reduces the number of parameters in the final fully connected layer. To reduce the possibility of the network memorizing specific features of the training data, add a small amount of dropout to the input to the last fully connected layer.
The network is small, as it has only five convolutional layers with few filters. numF controls the number of filters in the convolutional layers. To increase the accuracy of the network, try increasing the network depth by adding identical blocks of convolutional, batch normalization, and ReLU layers. You can also try increasing the number of convolutional filters by increasing numF.
Use a weighted cross entropy classification loss. weightedClassificationLayer(classWeights) . Specify the class weights in the same order as the classes appear in categories(YTrain). To give each class equal total weight in the loss, use class weights that are inversely proportional to the number of training examples in each class. When using the Adam optimizer to train the network, the training algorithm is independent of the overall normalization of the class weights.
Evaluate Trained Network
Calculate the final accuracy of the network on the training set (without data augmentation) and validation set. The network is very accurate on this data set. However, the training, validation, and test data all have similar distributions that do not necessarily reflect real-world environments. This limitation particularly applies to the unknown category, which contains utterances of only a small number of words.
Plot the confusion matrix. Display the precision and recall for each class by using column and row summaries. Sort the classes of the confusion matrix. The largest confusion is between unknown words and commands.
Code execution sequence :
- MFCCTrainingCode – Don’t close training progress graph
- ConfusionMatrixCode – Don’t close matrix chart
- MFCCSerailTestingCode – show Result_Table
- MFCCRandomTestingCode – command window output
- MFCCLiveDemo – Live record window
- LPCTrainingCode – Don’t close training progress graph
- ConfusionMatrixCode – Don’t close matrix chart
- LPCSerailTestingCode – show Result_Table
- LPCRandomTestingCode – command window output
- LPCLiveDemo – Live record window
- Finally show training progress comparison and confusion matrix comparison