Label Spoken Words in Audio Signals Using External API

This example shows how to label spoken words in Signal Labeler. The example uses the IBM® Watson Speech to Text API and Audio Toolbox™ software. See Speech-to-Text Transcription (Audio Toolbox) for instructions about:

  1. Downloading the Audio Toolbox speech2text extended functionality, available from MATLAB® Central.

  2. Setting up the IBM Watson Speech API, offered through IBM Cloud Services. You must create an IBM Cloud account, a Speech to Text service instance, and go to the service dashboard and copy your credentials – API Key and URL values. See the Getting Started Tutorial in the IBM documentation for more details.

Load Speech Data

Load an audio data file containing the sentence "Oak is strong, and also gives shade" spoken by a male voice. The signal is sampled at 44,100 Hz.

[y,fs] = audioread('oak.m4a');

% To hear, type soundsc(y,fs)
  1. Start Signal Analyzer and drag the signal to the Signal table. Select the signal.

  2. Add time information: on the Analyzer tab, click Time Values, select Sample Rate and Start Time, and specify fs as the sample rate.

  3. On the Analyzer tab, click Label. The signal appears in the Labeled Signal Set browser.

Define Label

Define a label to attach to the signal. Click Add Definition on the Label tab. Specify the Label Name as Words, select a Label Type of ROI, and enter the Data Type as string.

Create Custom Autolabeling Function

Create a custom function to label the words spoken in the audio file. (Code for the stt function appears later in the example.)

  1. Go to the directory where you have stored the speech2text P-code files and the JSON file that stores your IBM Coud credentials.

  2. To create the function, in the Analyzer tab, click Automate Value ▼ and select Add Custom Function. Signal Labeler shows a dialog box asking you to enter the name, description, and label type of the function to add. Enter stt in the Name field and select ROI as the Label Type. You can leave the Description field empty or you can enter your own description.

  3. Copy the function code and paste it in the empty template that appears. Save the file. The function appears in the gallery.

Locate and Identify Spoken Words

Locate and identify the words spoken in the input signal.

  1. In the Labeled Signal Set browser, select the check box next to y.

  2. Select Words in the Label Definitions browser.

  3. On the Automated Value gallery, select stt.

  4. Click Auto-Label and click OK in the dialog box that appears.

Signal Labeler locates and labels the spoken words.

Export Labeled Signal

Export the labeled signal. On the Label tab, click Save Labels. In the dialog box that appears, give the name transcribedAudio to the labeled signal set. Clicking OK returns you to Signal Analyzer. On the Signal table, select transcribedAudio and right-click to export it to a file called Transcription.mat.

Load the labeled signal set. The set has only one member. Get the names of the labels, and use the name to obtain and display the transcribed words.

load Transcription

ln = getLabelNames(transcribedAudio);

v = getLabelValues(transcribedAudio,1,ln)
v=7×2 table
     ROILimits       Value  
    ____________    ________

    0.09    0.56    "oak"   
    0.59    0.97    "is"    
       1    1.78    "strong"
    1.94    2.19    "and"   
    2.22    2.67    "also"  
    2.67    3.22    "gives" 
    3.25    3.91    "shade" 

Rearrange the words so that the sentence reads "Oak gives shade, and also is strong." Plot the signal using a different color for each word.

k = v([1 6:7 4:5 2:3],:);

s = getSignal(transcribedAudio,1);

sent = [];
sgs = NaN(height(s),height(k));
lgd = [];

for kj = 1:height(k)
    lm = length(sent);
    word = s.y(timerange(seconds(k.ROILimits(kj,1)),seconds(k.ROILimits(kj,2))));
    sent = [sent;word];
    sgs(lm+(1:length(word)),kj) = word;
    lgd = [lgd;(length(sent)-length(word)/2)/fs];
end

sgs(length(sent)+1:end,:) = [];

% To hear, type soundsc(sent,fs)

plot((0:length(sgs)-1)/fs,sgs)
text(lgd,-0.7*ones(size(lgd)),k.Value,'HorizontalAlignment',"center")
axis tight

stt Function: Locate and Identify Spoken Words

This function uses the IBM Watson Speech API and the Audio Toolbox speech2text extended functionality to extract spoken words from an audio file.

function [labelVals,labelLocs] = stt(x,t,parentLabelVal,parentLabelLoc,varargin)

aspeechObjectIBM = speechClient('IBM','timestamps',true,'model','en-US_NarrowbandModel');

fs = 1/(t(2)-t(1));

tixt = speech2text(aspeechObjectIBM,x,fs);

numLabels = numel(tixt.TimeStamps{:});
labelVals = strings(numLabels,1);
labelLocs = zeros(numLabels,2);

for idx =1:numLabels
    labelVals(idx) = tixt.TimeStamps{:}{idx}{1};
    labelLocs(idx,1) = tixt.TimeStamps{:}{idx}{2};
    labelLocs(idx,2) = tixt.TimeStamps{:}{idx}{3};
end

end

See Also

Apps

Functions

Related Examples

More About