Main Content

Feature Extraction

Mel spectrogram, MFCC, pitch, spectral descriptors

Extract features from audio signals for use as input to machine learning or deep learning systems. Use individual functions, such as melSpectrogram, mfcc, pitch, and spectralCentroid, or use the audioFeatureExtractor object to create a feature extraction pipeline that minimizes redundant calculations. In live scripts, use Extract Audio Features to graphically select the features to extract.


audioFeatureExtractorStreamline audio feature extraction
cepstralFeatureExtractorExtract cepstral features from audio segment
ivectorSystemCreate i-vector system

Live Editor Tasks

Extract Audio FeaturesStreamline audio feature extraction in the Live Editor


expand all

audioDeltaCompute delta features
designAuditoryFilterBankDesign auditory filter bank
melSpectrogramMel spectrogram
audioDeltaCompute delta features
cepstralCoefficientsExtract cepstral coefficients
gtccExtract gammatone cepstral coefficients, log-energy, delta, and delta-delta
mfccExtract MFCC, log energy, delta, and delta-delta of audio signal
vggishFeaturesExtract VGGish features
openl3FeaturesExtract OpenL3 features
audioDeltaCompute delta features
harmonicRatioHarmonic ratio
pitchEstimate fundamental frequency of audio signal
pitchnnEstimate pitch with deep learning neural network
audioDeltaCompute delta features
spectralCentroidSpectral centroid for audio signals and auditory spectrograms
spectralCrestSpectral crest for audio signals and auditory spectrograms
spectralDecreaseSpectral decrease for audio signals and auditory spectrograms
spectralEntropySpectral entropy for audio signals and auditory spectrograms
spectralFlatnessSpectral flatness for audio signals and auditory spectrograms
spectralFluxSpectral flux for audio signals and auditory spectrograms
spectralKurtosisSpectral kurtosis for audio signals and auditory spectrograms
spectralRolloffPointSpectral rolloff point for audio signals and auditory spectrograms
spectralSkewnessSpectral skewness for audio signals and auditory spectrograms
spectralSlopeSpectral slope for audio signals and auditory spectrograms
spectralSpreadSpectral spread for audio signals and auditory spectrograms
erb2hzConvert from equivalent rectangular bandwidth (ERB) scale to hertz
bark2hzConvert from Bark scale to hertz
mel2hzConvert from mel scale to hertz
hz2erbConvert from hertz to equivalent rectangular bandwidth (ERB) scale
hz2barkConvert from hertz to Bark scale
hz2melConvert from hertz to mel scale
phon2soneConvert from phon to sone
sone2phonConvert from sone to phon


Cepstral Feature ExtractorExtract cepstral features from audio segment


Spectral Descriptors

Overview and applications of spectral descriptors.

Learn Pre-Emphasis Filter Using Deep Learning

Use a convolutional deep network to learn a pre-emphasis filter for speech recognition.

Featured Examples

Speaker Verification Using i-Vectors

Speaker Verification Using i-Vectors

Speaker verification, or authentication, is the task of confirming that the identity of a speaker is who they purport to be. Speaker verification has been an active research area for many years. An early performance breakthrough was to use a Gaussian mixture model and universal background model (GMM-UBM) [1] on acoustic features (usually mfcc). For an example, see Speaker Verification Using Gaussian Mixture Models. One of the main difficulties of GMM-UBM systems involves intersession variability. Joint factor analysis (JFA) was proposed to compensate for this variability by separately modeling inter-speaker variability and channel or session variability [2] [3]. However, [4] discovered that channel factors in the JFA also contained information about the speakers, and proposed combining the channel and speaker spaces into a total variability space. Intersession variability was then compensated for by using backend procedures, such as linear discriminant analysis (LDA) and within-class covariance normalization (WCCN), followed by a scoring, such as the cosine similarity score. [5] proposed replacing the cosine similarity scoring with a probabilistic LDA (PLDA) model. [11] and [12] proposed a method to Gaussianize the i-vectors and therefore make Gaussian assumptions in the PLDA, referred to as G-PLDA or simplified PLDA. While i-vectors were originally proposed for speaker verification, they have been applied to many problems, like language recognition, speaker diarization, emotion recognition, age estimation, and anti-spoofing [10]. Recently, deep learning techniques have been proposed to replace i-vectors with d-vectors or x-vectors [8] [6].