Gabriele Bunkheila, MathWorks
Are you a signal processing engineer working on DSP algorithms, product development, or signal measurements? Are you trying to use more machine learning or deep learning in your projects?
In this session you will learn the fundamental ideas around the application of deep learning to audio, speech, and acoustics.
We start by discussing the use of established pre-trained deep learning models to solve a few complex but standard problems. We then show how to design, train, and deploy a complete speech command recognition system from scratch using MATLAB, starting from a reasonably large dataset and ending up with a real-time prototype.
Gabriele Bunkheila is a senior product manager at MathWorks, where he coordinates the strategy of MATLAB toolboxes for audio and DSP. After joining MathWorks in 2008, he worked as a signal processing application engineer for several years, supporting MATLAB and Simulink users across industries from algorithm design to real-time implementations. Before MathWorks, he held a number of research and development positions and he was a lecturer of sound theory and technologies at the national film school of Rome. He has a master’s degree in physics and a Ph.D. in communications engineering.
Recorded: 17 Mar 2021
Hello, and welcome to this introduction to deep learning for audio and speech applications. My name is Gabriele Bunkheila and I work in product management here at MathWorks focusing on our products for audio and digital signal processing. I thought it will be important to develop this content because first, speech audio and acoustic applications are among the largest and fastest growing application areas for deep learning. Second, deep learning can be fairly hard, not only to understand, but also to put it into practice, and especially for signal processing experts not specifically trained in machine learning. And third, speech, audio and acoustics applications tend to require more domain specific capabilities beyond what's normally included under the umbrella of deep learning.
So to get started and to be productive, you need a bit more than a generic learning introduction. Here's my level agenda for the next 45 minutes or so. I'll start with a few basic ideas on what I mean with deep learning, when it can be relevant to audio signals, and what to expect as a final user. Then I'll move on to showing you how to design, train, and deploy a fairly simple deep learning model for speech command recognition. And finally I'll try to go a bit more in-depth on a handful of selected topics to help you understand where to go next.
Let's get started then with some definitions and a couple of demos. Let's try to agree on the very basics first. Deep learning is a type of machine learning in which a model learns to perform highly complex tasks for image, times series, or text data. Deep learning is usually implemented using a neural network architecture. We talk about learning because it is all about creating neural networks. We do so by optimizing some parameters which we call weights. You might have heard about the term back propagation used to refer to the particular family of optimization algorithms used to do exactly this. And that optimization is really what we call learning, using cognitive analogy. In terms of the deep point of learning that's related to the high number of hidden layers involved in these types of systems, and also to mark the use of more recent technologies that then, that we're a game changer of the previous neural net systems that have been around since the 70s.
The role of those old neural network models would typically be limited to relatively simple tasks, like separating a configuration space. The real intelligence will still live in the data processing to extract features and other meaningful discriminating information. Deep learning allowed the possibility of many new different types of layers in between the data and the final consolidated part of the network. Those many additional layers allow the system to create a structured and more abstract understanding of the patterns in the data. On the flip side, the high number of layers and parameters to estimate call for a lot of data. Even with a lot of data, the complexity of the system could make the model impossible to optimize, but the deep learning breakthrough in both the invention of new types of layers, that mate is optimization problems possible to solve in practice.
The convolution layer is visible here in blue, are well known example of this. Convolutional neural networks, or CNNs is for short, tend to be built as a repetition of specific patterns of layered that proved particularly effective aid and in training. See here the convolutional, the ReLU nonlinearities, pooling layers. Like here on the far right of the diagram, you still find traditional fully connecte layers for the final classification tasks.
Now, if we simplify the network part of this diagram and we'll look at its input and output, we can say this model works as an image classifier. Classifiers are arguably the simplest type of machine learning. Let me show you an example of what a classifier would look like for audio data. In this MATLAB script, I can select one out of a collection of short audio signals stored in a cell array. One line three, I'm playing back the audio.
An on line four, I'm passing through pretrained classifier. Let me try a couple more.
One last go.
OK it looks like is that a pretty good job. I can represent a no decassifier like this very similar to this picture, with only a couple obvious modifications. Deep learning can be used in many other types of problem where an output is inferred from an input. For example, many signal processing applications are about returning an output signal from an input signal. You might call this a signal processing engine or use an application-specific name, like signal enhancer, if that's the aim of the game.
You could also have a network generate a signal based on a few numeric parameters. Here, you would use names like GAN, as in Generative Adversarial Networks, or decoders, as found in deep learning auto encoders. Whatever your types of inputs and outputs, besides ensuring that the model in the middle can learn the task, what you really need is a lot of annotated data to train it.
And that was the main reason why that short demo that I showed worked so well. Running under the hood was a model developed by Google trained on a huge data set made of annotated audio tracks from YouTube videos. The resulting pre-trained neural network is called YAMNet, and is able to classify sound types according to a fantastic hierarchical ontology of 521 different classes.
Let me mention in passing that this can do much more than plotting word clouds like returning individual time intervals or sound regions. In fact, the individual audio segments that I selected manually in that example were extracted automatically using classifying sound itself from an individual demo recording that ships with MATLAB. The other thing to mention is that this comes with a companion function called YAMNetGraph to navigate the class ontology and post-process the predictions.
For example, it may be that you don't care about something being exactly a bark, provided that you know it's the sound of a pet animal, et cetera. ClassifySound is an example of ready-to-use deep learning model packaged as a single line function code. MATLAB has other pre-trained models available. And if you're lucky enough to find just what you require, then you may not need to look further and build your own. However, that's what I'm going to talk about next.
One general important point before we move on, this diagram of a signal classifier is slightly misleading, as the vast majority of deep learning models don't learn from or operate directly on plain time domain signals. Instead, most often, signals undergo a processing step that can go under names like time frequency transformation or feature extraction. The few cases where networks operate on raw data are often referred to as end-to-end learning, and that's only a small subset of deep learning.
For example, when we used that fancy YAMNet model from Google to classify sounds, under the hood, the ClassifySound function also took care itself of transforming the time domain signals into the right format that the YAMNet network was trained to consume. We'll say more about this later on.
So in summary, when developing your own deep learning model, the three most basic things you need are, obviously, a network that can learn the task through optimizing its parameters, enough data for it to learn from, but also the right signal processing algorithms to prepare your data. We're going to use a practical example to see how to do all this. It will be a real-time speech command recognition. And let me show you first that the end result in action, and then I'll start to explain some of the details.
No. Up. Down. Left. Right. On. Off. Go. Go. Stop.
Just like YAMNet, the network used here under the hood was a CNN, or a Convolutional Neural Network, although somewhat smaller. You could see it, indeed, understanding commands live as audio came in. This network was trained on another smaller data set released by Google called the Speech Command dataset, and we'll say more about that shortly.
The live audio was transformed into a specific time frequency representation called Bach Spectrogram before being presented to the network, and the whole thing, signal processing and network predictions, was running on a Raspberry Pi board. MATLAB acquired the audio plotted during the time domain and sent it to the board via UDP. Predictions and spectrograms came back again live through UDP.
You can review all the code for this example at your own pace online, simply searching for speech command recognition using deep learning, and also its companion example, speech command recognition code generation using Raspberry Pi. So having concluded our introduction and our overview on just using a pre-developed deep learning tool for audio and speech, let's now dive into the specifics on these two open examples to understand how such a model can be built in practice.
We'll start from the dataset, and I'll walk you through the main steps down to running the trained model in real time, first in MATLAB and then on the Raspberry Pi. While reviewing this example offline, to get the code from MATLAB documentation, click on the Open Script button at the top part of the page.
Note that I've pre-captured all these codes used here to more easily point things out. That part about developing the model starts around line 183. It starts from downloading the dataset, if you don't have it yet.
I have already mentioned the annotated data used in this example, which is the Speech Command dataset released by Google in 2017. The data here was collected, labeled, and packaged to allow the development of a system that recognizes 10 vocal commands and tells them apart from other random words. It was released along with a blog post and research paper and a sample mobile app. Once you download it, you have thousands of one-second recordings organized in a simple folder hierarchy, with each bottom-level folder named after the word spoken in the recordings within.
Again, all recordings last one second, with around 3,500 files per each of the about 30 individual words, including words that are not actual commands. When you have such a large amount of data and a high number of files, it's important to step out of the file handling details, as you don't want to be running all the file handling code yourself. So here on line 199, we only pass the root folder of the data set to what we call an audio data store object, which takes care of quite a few things itself.
To start, it gives us a list of all the audio file that it found, each with its own label, the name of the folder it was found in. It also ensures that labels and file names are linked, so I can operate on the files through the labels. Now, we need to take all these recordings and organize them in a way that our deep learning model can learn from. Here's how we do it.
First, we define which subset of words we called commands. Then we pick a random 20% of the remaining words, and we label all those as unknown. These will be used to train the models to tell apart random words from actual commands. Based on those definitions, we build a new subset of the original data store that only includes what we need.
Notice that we did all that simply by using logical operators around the labels, known low-level commands, browse through the folder hierarchy, last folder contents, isolate specific file types, iterate, et cetera, et cetera. If you've ever written code from scratch to do such things, you will know what I'm talking about.
We are now left with only 11 sets of labeled recordings with about 1,800 files for each command and about three times as many in the unknown bucket. Once the data and the annotations are ready to go, remember what we said about the need to transform the signals into a form that the network can consume. As the title of the section suggests, this is where we turn the raw recordings into a specific type of two-dimensional time frequency representation. These go often under the generic name of spectrograms.
In this particular case, we're showing the user bark spectrograms. We do it through this audio feature extractor object. I'll say more about this later on. Still, notice that there are some parameters to choose on how you buffer your time domain signals, how detailed you want your frequency analysis, et cetera.
Sure enough, you can auto with the default values. These parameters affect your time and frequency resolution, and you'll find that many out there use similar choices that are based on well-established reasons. For example, for speech, it is common to see signal buffers in the region of 20 to 30 milliseconds.
Those parameters affect the size of the time frequency arrays produced by the transformation. For example, in this case, when you transform one second of time domain signal, you get 98 time slices and 50-point frequency spectra. Now, this bit of code is just an example based on a single random file from the data set. When it's about applying this computation to hundreds of thousands of files, you also need to think a bit about efficiency.
This example shows how to distribute feature extraction across the CPU cores available on your machine using this parfor pattern. Parfor stands for parallel for loop, and it will actually execute in parallel if you have Parallel Computing Toolbox available. Note that Datastore themselves make this pattern particularly inexpensive to code. Feature extraction can also be accelerated further using GPUs, though we won't cover that here in any detail in the interest of time.
One thing that I want to say is that in this example, we show how to extract all features from the data set at once. That's easiest and definitely recommended if your entire transformed data set fits in memory, or if it makes sense at least to store it again on disk. In other situations, feature extraction will have to happen on the fly, as the network trains. Once again, keep in mind that all your datastore makes that possible, but we won't have time to show it right here.
It will take a few minutes to transform the whole dataset. And once you're done, then this code makes you take a peek at three random samples from both the source data and their computed spectrograms, as in these plots here. At the top, notice the 16,000 samples at a 16 kilohertz sample rate confirming as one-second duration for each file. Below, only 98 points in time over x, as seen a few seconds ago.
Each vertical slice of the spectrogram comes from a time buffer partially overlapping with previous and next based on the parameters set in all your feature extractor. Of the y-axis below, you can see 50 points in frequency. That's a very low number for an ordinary spectrogram, but the bark spectrogram and other types of auditory spectrogram distribute frequencies more efficiently according to a perceptually aware near-logarithmic scale. On one hand, this allows to have a frequency scale that better reflects how we hear and the likely density of information in speech signals. On the other hand, that also helps limit the complexity of the network and of its training process.
Once the data transformation is completed, this bar chart summarizes how many files we've processed. Notice that besides the training data, visible on top, we also have a smaller validation dataset, visible below. While the training data is the data that the network uses for optimizing its weight during training, the validation data is different data that the network hasn't seen during training. Measuring its performance on the validation data gives the developer a good indication on how the network can generalize and how much it's actually learned. Notice that you only need a much smaller fraction of the data to do that.
Now that the data is ready, it's time to think about how to use it, and specifically, what network to use it with, which is what this new section is about. Generally speaking, you may hope to know something about the type of network you need, perhaps from a research paper. The one that I have here, for example, explicitly references the application that we're trying to build. A good paper like this would be quite specific on things like the types of layers used and the specific parameters defining each of those.
If you know how to read that information, you can express in MATLAB, for example, as a vector of layers. You can see here, the concatenation operator is used to create the layer vector. Others, however, would find it somewhat difficult to match the description in the paper with the exact programming syntax, even in MATLAB. So at least the first time, you may want to use an interactive app. This one that you're seeing here is called Deep Network Designer.
To start with, check which available layers match those described in your reference paper. Here, notice things like convolution layers, LSTM, nonlinear activation, pooling, and more. Also, don't forget to look at the required parameters in case you need anything different from the default values.
Then you can can pose the network through drag and drop. The first layer will be of type inputLayer, and then here we begin by stacking a typical CNN layer pattern. We have convolution, batch normalization, ReLU activation, and often a pooling layer. Likely, you end up with a few of these stacks, so directly copying and pasting groups of layers is also an option.
In our case, the network was already built programmatically, but we can still use Deep Network Designer to take a better look. Beyond the input layer, notice there were several convolutional layer stacks. And then in the end, the last few layers with a fully connected layer and a softmax layer, in charge of the final classification (there in a minute).
With a fully-assembled network, check that the architecture you designed is consistent by analyzing it. In this case, notice at the top right that no errors, no warning are reported. And in the last column of the table, take a look at the number of parameters this network needs to learn and where they belong, which, expectedly, is in the convolutional and fully connected layers.
Once you have the network ready, you can proceed to training it. In principle, training a network is an optimization problem. It is known under the name of back propagation. To run it at MATLAB, all you need to do is call this trainNetwork function. You do need to set a few options, although you could also go with the default values first and then experiment only once you become more familiar with what those parameters mean. Executing these few lines of code kicks off the training process.
What you see here is the same code executing on a network machine that I have access to equipped with a Nvidia GPU. It's common to follow the progress through a graphical monitoring tool. This takes a few minutes to run, but I pre-captured the process and accelerated it not to keep you waiting. The top plot shows the accuracy improving in percent units, the blue line for the training data, and the dashed black line for the validation data updating less frequently.
Once again, the validation data is not used to optimize the network weight, but only to check that it is obstructing well enough what it learns on the training data. Ideally, you want the black dashed line to be very close to the blue line. But statistically, it obviously can't be higher. Imagine trying this many times with different networks and training parameters until you get both training and validation accuracy high enough and close to each other.
In this case, we reached about 94% accuracy on the validation data. We took under 4 and 1/2 minutes using a single GPU. Know that if the GPU is available and visible to MATLAB, we literally don't have to do anything. MATLAB just uses it when it's down to run network training or inference.
Along with a quantitative evaluation, sometimes you need to evaluate your model in real-world scenarios to answer questions like, how would my model perform with that particular microphone, speaker accents, environment, or background noise which, perhaps, I couldn't use or didn't think about using in the validation set. So our live demonstrator fits all of those testing requirements, as well as offering a more impactful proof that your system works.
These days, a lab demonstrator is easily done even in MATLAB, if you know what you use. To run on live audio signals coming from a sound card, all you need is write a 401 loop, read a buffer of all your samples in the loop from the sound card using all your device reader, buffer the signal, and transform it into a spectrogram of the same shape and duration that the network was trained with. Feed the spectrum to the network and get a predicted label, and finally, plot the signal and the spectrogram and add the prediction as the title, which isn't just entirely visible in this pre-capture, but that's OK.
In the code, this all happens in the initial part of the example, as you can see from the line numbers here on the left. If we run this code section, everything should look very familiar, if only slightly different from what we saw earlier.
Yes. No. Left. Right. Stop Go. Sheila. Tree. Wow.
Once we have this running in MATLAB, this other example shows how to take this code and embed it into a Raspberry Pi board by generating C++ from all the relevant parts of the MATLAB code that we've just seen in action. When you open the new example, you'll get this new live script. In the interest of time, I'll simply point you to the key parts in here so you know what to expect. And I'll let you go through this on your own if you're interested in the details, perhaps with your own Raspberry Pi board if you happen to have one.
The key part of the story here is this command, codegen. Here, that generates C++ from this MATLAB function called HelperSpeechComm andRecognitionRasPi. And it builds the code into a standalone Linux application using the right tool shown on the Pi. The function HelperSpeechComm andRecognitionRasPi reads the audio segments continuously sent via UDP, buffers them, computes the bar spectrogram, runs the deep network, and it sends spectrograms and predictions to MATLAB by UDP again.
Keep in mind that the function could also easily read the audio directly from the local sound card off the board, if desired. Notice also that there are several facilities available to control the board from MATLAB, like for running system commands. Here, for example, we're executing the generated application on the board.
An interesting fact is that the codegen command could generate all sorts of C and C++ code. But in this case, we configured it so it uses the arm compute optimizations for deep learning based on the NEON SIMD extensions, which are supported by the CODEX A chip on the Raspberry Pi. By specifying explicitly that we are targeting the Raspberry Pi, we enable all building automations. So we directly build the complete standalone application on the board, and we have it ready to go.
The benefits of using optimized code are apparent when you look at the embedded profiling report generated at the end of the example. After running the code in PIL mode, where PIL stands for Processor In Loop, in red here, you see the time budget available for each iteration on a new audio buffer received, and in green, the approximate average time elapsed over time for each profile iteration. This shows very clearly that the Raspberry Pi was able to run the full prediction pipeline by quite some margin, despite having to on a fairly complex network.
This concludes my practical walkthrough on how to build a simple speech command recognition system based on deep learning. To recap, at the very beginning, we've seen what it looks like to use deep learning when all is done and prepared for you for you just to use, as for the classifySound function. And we've gone through some details on developing your own deep learning system for the relatively simple case of speech command recognition.
In both cases, we've discussed in more or less depth the role of network, data, and signal processing. Now I'd like to tell you something more about each of these areas, so you know where to go next when you tackle your own first deep learning project. And that will be the last part of this webinar, trying to go a bit deeper on three theory topics specifically.
Let's start from the actual deep learning models. With the speech command recognition example, we saw how to create a network from scratch in MATLAB based on the detailed specification found in papers or elsewhere, either programmatically or interactively using the Deep Network Designer app. There are cases when those models are also published or made available online directly, although possibly, they were developed using a language and a deep learning framework other than MATLAB.
It's important to know that MATLAB can directly important models developed in Keras or Caffe, and MathWorks is also a member of the ONNX consortium. So through ONNX, MATLAB exchanged models with almost all other popular deep learning frameworks, like TensorFlow and PyTorch. For a number of established networks, those may even be readily available in MATLAB. You may have noticed this collection of pre-trained networks available directly from deep network designer, and more may be available beyond these.
For example, Audio Toolbox has a documentation section on the machine learning and deep learning for audio called Pre-trained Networks. Unsurprisingly, one of the things that you'll find in there is the YAMNet network. If you click on that link, you'll find that one of the things that documentations teaches you is how you would go about re-implementing sound classification on your own, similarly to what we've just done for speech command recognition, as you can see from the results over here.
Also, even if I mentioned this already, finding a pre-trained network in MATLAB comes with a likely added benefit of additional support and help or features, like the yamnetGraph function for YAMNet, which helps you handle return classes of the audio set ontology at different levels of abstraction. The yamnet function itself returns a regular MATLAB network object with all the layer architecture visible and accessible.
Look at the input layer accepting Mel spectrograms with 64 frequency bands and 96 time slices, and the output layer with 521 classes that we saw earlier. The same granularity is accessible through the Deep Network Designer, once you open the network there. You will see again here visually much of the same we saw a minute ago in the Command window.
All these layers can, of course, be taken away or replaced with others to leverage the parts of the pre-training structure. For example, use Transfer Learning to repurpose the network for a different but related task, as you see here. I hope this gave you a few more ideas on alternatives to just creating your own deep learning network from scratch or from a written specification.
Next, I'd like to give you a few extra pointers specific to data. Let's be clear. A dataset like the one we used for our simple speech command recognition example will never-- or most unlikely-- be available when you tackle your first commercial application. And possibly, my most important takeaway on data and datasets is the industry segments really working with deep learning invest a lot more research resources in data engineering than they spend in network engineering.
These specific charts on the top right come from a tool given a Andrej Karpathy, who is the head of AI at Tesla, in 2018. His point is that the emphasis and interest towards network engineering fund in academic research has nothing to do with what happens in industry, where creating the right data is really the key concern.
My second takeaway is twofold. On one hand, training deep networks require huge training datasets. On the other hand, validation datasets do not need to be as big. That means that the strategies and technologies to build annotated data sets may be radically different for the two types of data. I say more about this topic in this virtual session from MATLAB Expo 2020, if you're interested.
Validation datasets, for example, can often be labeled interactively, as they require freshly-acquired, application-specific data and high-quality annotations. The signal labeler and the audio labeler are two great examples of labeling apps available in MATLAB that can be used for interactive human annotation.
When creating very large training datasets, it's extremely challenging to rely on new recordings and human labeling, just because of the sheer size. You'll have to rely on automated labeling approaches using highly capable, pre-trained AI systems, and all programmatic repurposing of existing annotated data. I collected here a few classes of capabilities that are relevant in this space, with a few example of fairly recent MATLAB functions in those areas. Note, for example, speech2text, which wraps cloud-based speech API from external providers, and the classifySound function that we've seen earlier.
Two other techniques that are key for building training datasets are data synthesis and data augmentation. What I mean here for data synthesis is creating totally new data artificially with such a level of fidelity that it will allow the model to learn about patterns existing in real data. A good example here is the MATLAB function text2speech, which leverages cloud systems like Google Deep Mind's quite famous WaveNet to produce realistic and customizable voice samples.
For data augmentation, I mean processing existing data to create altered replicas. On one hand, this can help grow the size of the training data set, which is always welcome. On the other hand, it can also help improve the quality of existing training data to help the network perform better on more realistic validation data, which is ultimately what matters most.
Probably the best MATLAB example here is the audioDataAugmenter object, which allows combining audio and speech processing effects, including a number of built-in algorithms on pitch and time, to build randomized augmentations pipelines. I'll stop talking about annotated datasets to move on to my very last topic, which is about selecting the right signal features or transformations to create network inputs.
First, let's review the type of time frequency transformation that we use in the speech command recognition example. When computing spectrograms many other audio and speech features, it is very common to divide signals into buffers, which could be mutually overlapping for better time resolution. Most time frequency transformations are based upon transforming each of those buffers into the frequency domain via an FFT or a similar operation. The simplest of those is the short-time Fourier transform, or STFT, often also referred to as plain spectrogram.
Time frequency transforms are used very often with network types that were originally designed to work with images, since these 2D signal representations look, indeed, a lot like images. When working with speech or other audible signals, you tend to see more advanced types of spectrograms with frequency scale adapted to different models of human perception. These are more compact while keeping a good level of perceptual detail.
Plain short-time Fourier transforms tend to use large numbers of NFFT frequency points over the y-axis here. However, that doesn't reflect the selectivity of human hearing nor, arguably, the density of information of a frequency for speech signals. In other words, to have fine enough resolution at low frequencies, where human perception is most selective, you also end up with too much information at higher frequencies, where that's not needed at all. The side effect is that if we are providing this signal representation as input to our network, we aren't likely providing too much redundant data that may push up the network complexity or lead it to focus on irrelevant information.
Instead, we can reorganize the way information is grouped by projecting the raw FFT points on the y-axis over a set of bands that are more representative of how human perception distributes its sensitivity. The bark filter back that we used in our example is one of the most used filter backs to modern auditory sensitivity of a frequency, along with Mel and ERB. You will notice bandwidth increasing with frequency, and filter height decreasing accordingly to conserve energy across filters.
If we project our y-axis points over the bark filters and sum up all the contribution within each of the filters, we end up with a more compact representation, this time with a number of points on the y-axis equal to the number of bark filters, which is much smaller than the original number of FFT bands. As expected, you can also notice how frequency values are now distributed near logarithmic. In the speech command recognition example, where we used bark spectrograms, we only ran 50 bands over the frequency axis.
So here's the same auditory spectrum towards the right of this algorithm diagram. Using a bark filter band will produce a bark spectrogram. A Mel filter bank, a Mel spectrograms, probably even more popular, and so forth. In some cases, these are not used as network input themselves, but they are further post-processed to extract sound specific scalar metrics, called, for example, spectral descriptors. Examples include spectral entropy, spectral centroid, spectrum flux, and many others.
There's also another class of more advanced but very used features called cepstral coefficients, which are computed from an additional discrete cosine transform operation on the auditory spectrograms. The most used of these is called MFCC, short-form Mel-frequency cepstral coefficients. These and many other commonly-used features are readily available as MATLAB function these days, so you don't have to implement these steps yourself and worrying about matching exactly what others have used.
One of the challenges is actually remembering and finding all the possible options for features. So here's something you may find useful. We've already encountered that audioFeatureExtractor object in our speech command recognition example. In the live editor in MATLAB, you can create an audioFeatureExtractor live task. This looks a bit like an app embedded in the editor itself.
You can use it to specify how you want to buffer and window your signal and then move on to selecting which among the available features you'd like to extract. Notice how in this example, we ended up with 64 features per vector. See, look at it is showing in this summary, with all the individual indices. That's also visible here in the MATLAB workspace.
One of the advantages is that when you select more features in this way, sound computation will be shared, and not carried out twice. You can also easily find out about computation parameters that you may have overlooked when trying to match your features to those from a reference publication. And once you're done, you can decide to only keep the code, and now you can see that this is the same audioFeatureExtractor object that we used previously.
The object itself can also display what feature it's being configured to extract. Within audioFeatureExtractor, you'll find most of the established audio and speech features based on standard speech and signal processing. There are then more advanced types of features, which may come themselves from applying deep networks to signals or spectrograms. One great example this kind are vggish features, vectors of 128 high-level features produced by the fairly popular vggish deep network. These can be used as inputs to other networks, or even with simpler traditional machine learning algorithms.
Yet another advanced feature extraction technique with a deep neural flavor is wavelet scattering. In some publications, this is seen as an evolution of the MFCC that we saw a few minutes ago. It also bears analogies to convolution networks, as it works by applying fixed wavelet-based convolutional kernels in series of steps that are closely related to convolutional neural networks.
And with that, I should really have reached the end of the topics that I had set out to cover. So let me quickly list my conclusions from this presentation. First, although developing deep learning systems is hard, I hope I conveyed that you have examples and tools available in MATLAB to get started right away with deep learning on audio, speech, and acoustic signals.
Second, although a lot of focus in research is placed on the actual deep learning models, once you have a model architecture in hand, do not underestimate the importance of collecting, annotating, augmenting, transforming, or extracting the right features from the data. Specifically, you won't be able to train any useful network without the ability to create, understand, and process your data at will, which is precisely why I personally think MATLAB should be a prominent part of your toolbox when working on deep learning for audio and speech.
Talking about toolboxes-- and this is my last slide-- here's the list of MATLAB toolboxes that I mostly referred to during this presentation. And that's really everything I had. Thank you for watching.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .Select web site
You can also select a web site from the following list:
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.