Preprocess audio for VGGish feature extraction
Audio Toolbox / Deep Learning
The VGGish Preprocess block generates mel spectrograms from an audio input that you can then feed to the VGGish pretrained network or to a network that accepts the same inputs as VGGish.
Port_1 — Sound data
Sound data, specified as a one-channel signal (column vector). If Sample rate of input signal (Hz) is 16e3, there are no restrictions on the input frame length. If Sample rate of input signal (Hz) is different from 16e3, then the input frame length must be a multiple of the decimation factor of the resampling operation that the block performs. If the input frame length does not satisfy this condition, the block throws an error message with information on the decimation factor.
Port_1 — Mel spectrogram
Mel spectrogram generated from the input audio signal, returned as a 96-by-64 matrix, where:
96–– Represents the number of 25 ms frames in each mel spectrogram
64–– Represents the number of mel bands spanning 125 Hz to 7.5 kHz
The overlap between consecutive 96-by-64 mel spectrograms is determined by the value of the Overlap percentage (%) parameter. You can provide the mel spectrogram as an input to the VGGish pretrained network or to a network that accepts the same inputs as VGGish.
Sample rate of input signal (Hz) — Sample rate of input signal in Hz
16e3 (default) | positive scalar
Sample rate of the input signal in Hz, specified as a positive scalar.
Overlap percentage (%) — Overlap percentage between consecutive mel spectrograms
50 (default) | [0 100)
Specify the overlap percentage between consecutive mel spectrograms as a scalar in the range [0 100).
The VGGish Embeddings block preprocesses the audio data using the following steps to be in the format required by the VGGish network.
Cast the audio data to single precision and resample to 16 kHz.
Compute one-sided short-time Fourier transform using a 25 ms periodic Hann window (400 samples) with a 10 ms hop (160 samples) and a 512-point DFT.
Convert the complex spectral values to magnitude and discard phase information.
Pass the one-sided magnitude STFTs through a 64-band mel-spaced filter bank. Doing so converts the 257-length STFT vectors to 64-length vectors in the mel scale.
Convert the 64-length vectors to a log scale.
Buffer the vectors into outputs of size 96-by-64, where 96 is the number of spectra in the mel spectrogram and 64 is the number of mel bands. The overlap between consecutive 96-by-64 mel spectrograms is determined by the value of the Overlap percentage (%) parameter.
 Gemmeke, Jort F., Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. “Audio Set: An Ontology and Human-Labeled Dataset for Audio Events.” In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 776–80. New Orleans, LA: IEEE, 2017. https://doi.org/10.1109/ICASSP.2017.7952261.
 Hershey, Shawn, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, et al. “CNN Architectures for Large-Scale Audio Classification.” In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 131–35. New Orleans, LA: IEEE, 2017. https://doi.org/10.1109/ICASSP.2017.7952132.