Transcribe speech signal to text
speech2text with the third-party speech services, you
must download the extended Audio Toolbox™ functionality from File Exchange. The File Exchange submission includes a tutorial to get started
with the third-party services.
Using wav2vec 2.0 requires Deep Learning Toolbox™ and installing the pretrained model.
Download wav2vec 2.0 Network
Download and install the pretrained wav2vec 2.0 model for speech-to-text transcription.
speechClient("wav2vec2.0") into the command line. If the pretrained model for wav2vec 2.0 is not installed, the function provides a download link. To install the model, click the link to download the file and unzip it to a location on the MATLAB path.
Alternatively, execute the following commands to download the wav2vec 2.0 model, unzip it to your temporary directory, and then add it to your MATLAB path.
downloadFile = matlab.internal.examples.downloadSupportFile("audio","wav2vec2/wav2vec2-base-960.zip"); wav2vecLocation = fullfile(tempdir,"wav2vec"); unzip(downloadFile,wav2vecLocation) addpath(wav2vecLocation)
Check that the installation is successful by typing
speechClient("wav2vec2.0") into the command line. If the model is installed, then the function returns a
ans = Wav2VecSpeechClient with properties: Segmentation: 'word' TimeStamps: 0
Perform Speech-to-Text Transcription
Read in an audio file containing speech and listen to it.
[y,fs] = audioread("speech_dft.wav"); soundsc(y,fs)
speechClient object that uses the wav2vec 2.0 pretrained network. This requires installing the pretrained network. If the network is not installed, the function provides a link with instructions to download and install the pretrained model.
transcriber = speechClient("wav2vec2.0");
speech2text to obtain a transcription of the audio signal.
transcript = speech2text(transcriber,y,fs)
transcript=12×2 table Transcript Confidence ___________ __________ "the" 0.97605 "discreet" 0.91927 "fourier" 0.84546 "transform" 0.89922 "of" 0.66676 "a" 0.50026 "real" 0.88796 "valued" 0.89913 "signal" 0.8041 "is" 0.53891 "conjugate" 0.98438 "symmetric" 0.89333
clientObj — Client object
Client object, specified as an object returned by
speechClient. The object is an interface to a pretrained wav2vec 2.0 model
or to a third-party speech service.
speech2text with wav2vec 2.0 requires Deep Learning Toolbox and installing the pretrained wav2vec 2.0 model. If the model is not
"wav2vec2.0" provides a link to download and install the
To use any of the third-party speech services, you must download the extended Audio Toolbox functionality from File Exchange. The File Exchange submission includes a tutorial to get started with the third-party services.
audioIn — Audio input
Audio input signal, specified as a column vector (single channel).
fs — Sample rate (Hz)
Sample rate in Hz, specified as a positive scalar.
timeout — Time to wait for server connection in seconds
10 (default) | positive scalar
Time to wait for initial server connection in seconds, specified as a positive scalar.
This argument is enabled only when the
clientObj is one of the
third-party speech services.
transcript — Speech transcript
table | string
Speech transcript of the input audio signal, returned as a table with a column containing the transcript and another column containing the associated confidence metrics.
interfaces with the wav2vec 2.0 pretrained model and you set the object
Segmentation property to
"none" when creating
speech2text returns the transcript as a
 Baevski, Alexei, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” 2020. https://doi.org/10.48550/ARXIV.2006.11477.
Introduced in R2022b