Speaker verification, or authentication, is the task of confirming that the identity of a speaker is who they purport to be. Speaker verification has been an active research area for many years. An early performance breakthrough was to use a Gaussian mixture model and universal background model (GMM-UBM) [1] on acoustic features (usually `mfcc`

). For an example, see Speaker Verification Using Gaussian Mixture Model. One of the main difficulties of GMM-UBM systems involves intersession variability. Joint factor analysis (JFA) was proposed to compensate for this variability by separately modeling inter-speaker variability and channel or session variability [2] [3]. However, [4] discovered that channel factors in the JFA also contained information about the speakers, and proposed combining the channel and speaker spaces into a *total variability* space. Intersession variability was then compensated for by using backend procedures, such as linear discriminant analysis (LDA) and within-class covariance normalization (WCCN), followed by a scoring, such as the cosine similarity score. [5] proposed replacing the cosine similarity scoring with a probabilistic LDA (PLDA). While i-vectors were originally proposed for speaker verification, they have been applied to many problems, like language recognition, speaker diarization, emotion recognition, age estimation, and anti-spoofing [10]. Recently, deep learning techniques have been proposed to replace i-vectors with *d-vectors* or *x-vectors* [8] [6].

In this example, you develop a simple i-vector system for speaker verification that uses an LDA-WCCN backend with cosine similarity scoring. The example also makes the popular simplification of a diagonal covariance matrix for the GMM.

Throughout the example, you will find live controls on tunable parameters. Changing the controls does not rerun the example. If you change a control, you must rerun the example.

This example uses the Pitch Tracking Database from Graz University of Technology (PTDB-TUG) [7]. The data set consists of 20 English native speakers reading 2342 phonetically rich sentences from the TIMIT corpus. Download and extract the data set. Depending on your system, downloading and extracting the data set can take approximately 1.5 hours.

url = 'https://www2.spsc.tugraz.at/databases/PTDB-TUG/SPEECH_DATA_ZIPPED.zip'; downloadFolder = tempdir; datasetFolder = fullfile(downloadFolder,'PTDB-TUG'); if ~exist(datasetFolder,'dir') disp('Downloading PTDB-TUG (3.9 G) ...') unzip(url,datasetFolder) end

Create an `audioDatastore`

object that points to the data set. The data set was originally intended for use in pitch-tracking training and evaluation, and includes laryngograph readings and baseline pitch decisions. Use only the original audio recordings.

ads = audioDatastore([fullfile(datasetFolder,"SPEECH DATA","FEMALE","MIC"),fullfile(datasetFolder,"SPEECH DATA","MALE","MIC")], ... 'IncludeSubfolders',true, ... 'FileExtensions','.wav'); fileNames = ads.Files;

The file names contain the speaker IDs. Decode the file names to set the labels on the `audioDatastore`

object.

speakerIDs = extractBetween(fileNames,'mic_','_'); ads.Labels = categorical(speakerIDs); countEachLabel(ads)

`ans=`*20×2 table*
Label Count
_____ _____
F01 236
F02 236
F03 236
F04 236
F05 236
F06 236
F07 236
F08 234
F09 236
F10 236
M01 236
M02 236
M03 236
M04 236
M05 236
M06 236
⋮

Separate the `audioDatastore`

object in two: one for training, and one for enrollment and verification. The training set contains 16 speakers. The enrollment and verification set contains the other four speakers. You will split the enrollment and verification set later in the example.

speakersToTest = categorical(["M01","M05","F01","F05"]); adsTrain = subset(ads,~ismember(ads.Labels,speakersToTest)); adsEnrollAndVerify = subset(ads,ismember(ads.Labels,speakersToTest)); adsTrain = shuffle(adsTrain); adsEnrollAndVerify = shuffle(adsEnrollAndVerify);

Display the label distributions of the two `audioDatastore`

objects.

countEachLabel(adsTrain)

`ans=`*16×2 table*
Label Count
_____ _____
F02 236
F03 236
F04 236
F06 236
F07 236
F08 234
F09 236
F10 236
M02 236
M03 236
M04 236
M06 236
M07 236
M08 236
M09 236
M10 236

countEachLabel(adsEnrollAndVerify)

`ans=`*4×2 table*
Label Count
_____ _____
F01 236
F05 236
M01 236
M05 236

Read an audio file from the training data set, listen to it, and plot it. Reset the datastore.

[audio,audioInfo] = read(adsTrain); fs = audioInfo.SampleRate; t = (0:size(audio,1)-1)/fs; sound(audio,fs) plot(t,audio) xlabel('Time (s)') ylabel('Amplitude') axis([0 t(end) -1 1]) title('Sample Utterance from Training Set')

reset(adsTrain)

You can reduce the data set used in this example to speed up the runtime at the cost of performance. In general, reducing the data set is a good practice for development and debugging.

reduceDataset = false; if reduceDataset adsTrain = splitEachLabel(adsTrain,21); adsEnrollAndVerify = splitEachLabel(adsEnrollAndVerify,21); end

Create an `audioFeatureExtractor`

object to extract 20 MFCCs, 20 delta-MFCCs, and 20 delta-delta MFCCs. Use a delta window length of 9. Extract features from 25 ms Hann windows with a 10 ms hop.

numCoeffs = 20; deltaWindowLength = 9; windowDuration = 0.025; hopDuration = 0.01; windowSamples = round(windowDuration*fs); hopSamples = round(hopDuration*fs); overlapSamples = windowSamples - hopSamples; afe = audioFeatureExtractor( ... 'SampleRate',fs, ... 'Window',hann(windowSamples,'periodic'), ... 'OverlapLength',overlapSamples, ... ... 'mfcc',true, ... 'mfccDelta',true, ... 'mfccDeltaDelta',true); setExtractorParams(afe,'mfcc','DeltaWindowLength',deltaWindowLength,'NumCoeffs',numCoeffs)

Extract features from the audio read from the training datastore. Features are returned as a `numHops`

-by-`numFeatures`

matrix.

features = extract(afe,audio); [numHops,numFeatures] = size(features)

numHops = 559

numFeatures = 60

Training an i-vector system is computationally expensive and time-consuming. If you have Parallel Computing Toolbox™, you can spread the work across multiple cores to speed up the example. Determine the optimal number of partitions for your system. If you do not have Parallel Computing Toolbox™, use a single partition.

if ~isempty(ver('parallel')) && ~reduceDataset pool = gcp; numPar = numpartitions(adsTrain,pool); else numPar = 1; end

Starting parallel pool (parpool) using the 'local' profile ... Connected to the parallel pool (number of workers: 6).

Use the helper function, `helperFeatureExtraction`

, to extract all features from the data set. The helperFeatureExtraction function extracts MFCC from regions of speech in the audio. The speech detection is performed by the `detectSpeech`

function.

featuresAll = {}; tic parfor ii = 1:numPar adsPart = partition(adsTrain,numPar,ii); featuresPart = cell(0,numel(adsPart.Files)); for iii = 1:numel(adsPart.Files) audioData = read(adsPart); featuresPart{iii} = helperFeatureExtraction(audioData,afe,[]); end featuresAll = [featuresAll,featuresPart]; end allFeatures = cat(2,featuresAll{:}); fprintf('Feature extraction from training set complete (%0.0f seconds).',toc)

Feature extraction from training set complete (71 seconds).

Calculate the global mean and standard deviation of each feature. You will use these in future calls to the `helperFeatureExtraction`

function to normalize the features.

normFactors.Mean = mean(allFeatures,2,'omitnan'); normFactors.STD = std(allFeatures,[],2,'omitnan');

Initialize the Gaussian mixture model (GMM) that will be the universal background model (UBM) in the i-vector system. The component weights are initialized as evenly distributed. Systems trained on the TIMIT data set usually contain around 2048 components.

numComponents = 256; alpha = ones(1,numComponents)/numComponents; mu = randn(numFeatures,numComponents); vari = rand(numFeatures,numComponents) + eps; ubm = struct('ComponentProportion',alpha,'mu',mu,'sigma',vari);

Train the UBM using the expectation-maximization (EM) algorithm.

maxIter = 10; tic for iter = 1:maxIter tic % EXPECTATION N = zeros(1,numComponents); F = zeros(numFeatures,numComponents); S = zeros(numFeatures,numComponents); L = 0; parfor ii = 1:numPar adsPart = partition(adsTrain,numPar,ii); while hasdata(adsPart) audioData = read(adsPart); % Extract features Y = helperFeatureExtraction(audioData,afe,normFactors); % Compute a posteriori log-liklihood logLikelihood = helperGMMLogLikelihood(Y,ubm); % Compute a posteriori normalized probability amax = max(logLikelihood,[],1); logLikelihoodSum = amax + log(sum(exp(logLikelihood-amax),1)); gamma = exp(logLikelihood - logLikelihoodSum)'; % Compute Baum-Welch statistics n = sum(gamma,1); f = Y * gamma; s = (Y.*Y) * gamma; % Update the sufficient statistics over utterances N = N + n; F = F + f; S = S + s; % Update the log-likelihood L = L + sum(logLikelihoodSum); end end % Print current log-likelihood fprintf('Training UBM: %d/%d complete (%0.0f seconds), Log-likelihood = %0.0f\n',iter,maxIter,toc,L) % MAXIMIZATION N = max(N,eps); ubm.ComponentProportion = max(N/sum(N),eps); ubm.ComponentProportion = ubm.ComponentProportion/sum(ubm.ComponentProportion); ubm.mu = F./N; ubm.sigma = max(S./N - ubm.mu.^2,eps); end

Training UBM: 1/10 complete (112 seconds), Log-likelihood = -140389008 Training UBM: 2/10 complete (104 seconds), Log-likelihood = -80118143 Training UBM: 3/10 complete (107 seconds), Log-likelihood = -76497460 Training UBM: 4/10 complete (104 seconds), Log-likelihood = -74883100 Training UBM: 5/10 complete (103 seconds), Log-likelihood = -74039778 Training UBM: 6/10 complete (108 seconds), Log-likelihood = -73545747 Training UBM: 7/10 complete (106 seconds), Log-likelihood = -73241619 Training UBM: 8/10 complete (103 seconds), Log-likelihood = -73044282 Training UBM: 9/10 complete (103 seconds), Log-likelihood = -72905317 Training UBM: 10/10 complete (103 seconds), Log-likelihood = -72799523

The Baum-Welch statistics are the * N* (zeroth order) and

`F`

$${\mathit{N}}_{\mathit{c}}\left(\mathit{s}\right)=\sum _{\mathit{t}}{\gamma}_{\mathit{t}}\left(\mathit{c}\right)$$

$${\mathit{F}}_{\mathit{c}}\left(\mathit{s}\right)=\sum _{\mathit{t}}{\gamma}_{\mathit{t}}\left(\mathit{c}\right){\mathit{Y}}_{\mathit{t}}$$

${\mathit{Y}}_{\mathit{t}}$ is the feature vector at time $\mathit{t}$.

$\mathit{s}\in \left\{{\mathit{s}}_{1},{\mathit{s}}_{2,}...,{\mathit{s}}_{\mathit{N}}\right\}$, where $\mathit{N}$ is the number of speakers. For the purposes of training the total variability space, each audio file is considered a separate speaker (whether or not it belongs to a physical single speaker).

${\gamma}_{\mathit{t}}\left(\mathit{c}\right)$ is the posterior probability that the UBM component $\mathit{c}$ accounts for the feature vector ${\mathit{Y}}_{\mathit{t}}$.

Calculate the zeroth and first order Baum-Welch statistics over the training set.

numSpeakers = numel(adsTrain.Files); Nc = {}; Fc = {}; tic parfor ii = 1:numPar adsPart = partition(adsTrain,numPar,ii); numFiles = numel(adsPart.Files); Npart = cell(1,numFiles); Fpart = cell(1,numFiles); for jj = 1:numFiles audioData = read(adsPart); % Extract features Y = helperFeatureExtraction(audioData,afe,normFactors); % Compute a posteriori log-likelihood logLikelihood = helperGMMLogLikelihood(Y,ubm); % Compute a posteriori normalized probability amax = max(logLikelihood,[],1); logLikelihoodSum = amax + log(sum(exp(logLikelihood-amax),1)); gamma = exp(logLikelihood - logLikelihoodSum)'; % Compute Baum-Welch statistics n = sum(gamma,1); f = Y * gamma; Npart{jj} = reshape(n,1,1,numComponents); Fpart{jj} = reshape(f,numFeatures,1,numComponents); end Nc = [Nc,Npart]; Fc = [Fc,Fpart]; end fprintf('Baum-Welch statistics completed (%0.0f seconds).\n',toc)

Baum-Welch statistics completed (102 seconds).

Expand the statistics into matrices and center $\mathit{F}\left(\mathit{s}\right)$, as described in [3], such that

$\mathit{N}\left(\mathit{s}\right)$ is a $\mathit{C}\text{\hspace{0.17em}}\mathit{F}\times \mathit{C}\text{\hspace{0.17em}}\mathit{F}$ diagonal matrix whose blocks are ${\mathit{N}}_{\mathit{c}}\left(\mathit{s}\right)\mathit{I}\text{\hspace{0.17em}}\left(\mathit{c}=1,...\mathit{C}\right)$.

$\mathit{F}\left(\mathit{s}\right)$ is a $\mathit{C}\text{\hspace{0.17em}}\mathit{F}\times 1$ supervector obtained by concatenating ${\mathit{F}}_{\mathit{c}}\left(\mathit{s}\right)\text{\hspace{0.17em}\hspace{0.17em}}\left(\mathit{c}=1,...\mathit{C}\right)$.

$\mathit{C}$ is the number of components in the UBM.

$\mathit{F}$ is the number of features in a feature vector.

N = Nc; F = Fc; muc = reshape(ubm.mu,numFeatures,1,[]); for s = 1:numSpeakers N{s} = repelem(reshape(Nc{s},1,[]),numFeatures); F{s} = reshape(Fc{s} - Nc{s}.*muc,[],1); end

Because this example assumes a diagonal covariance matrix for the UBM, `N`

are also diagonal matrices, and are saved as vectors for efficient computation.

In the i-vector model, the ideal speaker supervector consists of a speaker-independent component and a speaker-dependent component. The speaker-dependent component consists of the total variability space model and the speaker's i-vector.

$$\mathit{M}=\mathit{m}+\mathit{Tw}$$

$\mathit{M}$ is the speaker utterance supervector

$\mathit{m}$ is the speaker- and channel-independent supervector, which can be taken to be the UBM supervector.

$\mathit{T}$ is a low-rank rectangular matrix and represents the total variability subspace.

$\mathit{w}$ is the i-vector for the speaker

The dimensionality of the i-vector, $\mathit{w}$, is typically much lower than the C F -dimensional speaker utterance supervector, making the i-vector, or i-vectors, a much more compact and tractable representation.

To train the total variability space, $\mathit{T}$, first randomly initialize `T`

, then perform these steps iteratively [3]:

Calculate the posterior distribution of the hidden variable.

$${\mathit{l}}_{\mathit{T}}\left(\mathit{s}\right)=\mathit{I}+{\mathit{T}}^{\prime}\times {\Sigma}^{-1}\times \mathit{N}\left(\mathit{s}\right)\times \mathit{T}$$

2. Accumulate statistics across the speakers.

$${\rm K}=\sum _{\mathit{s}}\mathit{F}\left(\mathit{s}\right)\times {\left({\mathit{l}}_{\mathit{T}}^{-1}\left(\mathit{s}\right)\times {\mathit{T}}^{\prime}\times {\Sigma}^{-1}\times \mathit{F}\left(\mathit{s}\right)\right)}^{\prime}$$

${\mathit{A}}_{\mathit{c}}=\sum _{\mathit{s}}{\mathit{N}}_{\mathit{c}}\left(\mathit{s}\right){\mathit{l}}_{\mathit{T}}^{-1}\left(\mathit{s}\right)$

3. Update the total variability space.

$${\mathit{T}}_{\mathit{c}}={\mathit{A}}_{\mathit{c}}^{-1}\times {\rm K}$$

$$\mathit{T}=\left[\begin{array}{c}{\mathit{T}}_{1}\\ {\mathit{T}}_{2}\\ \vdots \\ {\mathit{T}}_{\mathit{C}}\end{array}\right]$$

[3] proposes initializing $\Sigma $ by the UBM variance, and then updating $\Sigma $ according to the equation:

$$\Sigma ={\left(\sum _{\mathit{s}}\mathit{N}\left(\mathit{s}\right)\right)}^{-1}\left(\left(\sum _{\mathit{s}}\mathit{S}\left(\mathit{s}\right)\right)-\mathrm{diag}\left({\rm K}\times {\mathit{T}}^{\prime}\right)\right)$$

where S(s) is the centered second-order Baum-Welch statistic. However, updating $\Sigma $ is often dropped in practice as it has little effect. This example does not update $\Sigma $.

Create the sigma variable.

Sigma = ubm.sigma(:);

Specify the dimension of the total variability space. A typical value used for the TIMIT data set is 1000.

`numTdim = 256;`

Initialize `T`

and the identity matrix, and preallocate cell arrays.

T = randn(numel(ubm.sigma),numTdim); T = T/norm(T); I = eye(numTdim); Ey = cell(numSpeakers,1); Eyy = cell(numSpeakers,1); Linv = cell(numSpeakers,1);

Set the number of iterations for training. A typical value reported is 20.

`numIterations = 5;`

Run the training loop.

for iterIdx = 1:numIterations tic % 1. Calculate the posterior distribution of the hidden variable TtimesInverseSSdiag = (T./Sigma)'; parfor s = 1:numSpeakers L = (I + TtimesInverseSSdiag.*N{s}*T); Linv{s} = pinv(L); Ey{s} = Linv{s}*TtimesInverseSSdiag*F{s}; Eyy{s} = Linv{s} + Ey{s}*Ey{s}'; end % 2. Accumlate statistics across the speakers Eymat = cat(2,Ey{:}); FFmat = cat(2,F{:}); Kt = FFmat*Eymat'; K = mat2cell(Kt',numTdim,repelem(numFeatures,numComponents)); newT = cell(numComponents,1); for c = 1:numComponents AcLocal = zeros(numTdim); for s = 1:numSpeakers AcLocal = AcLocal + Nc{s}(:,:,c)*Eyy{s}; end % 3. Update the Total Variability Space newT{c} = (pinv(AcLocal)*K{c})'; end T = cat(1,newT{:}); fprintf('Training Total Variability Space: %d/%d complete (%0.0f seconds).\n',iterIdx,numIterations,toc) end

Training Total Variability Space: 1/5 complete (127 seconds). Training Total Variability Space: 2/5 complete (130 seconds). Training Total Variability Space: 3/5 complete (128 seconds). Training Total Variability Space: 4/5 complete (128 seconds). Training Total Variability Space: 5/5 complete (128 seconds).

Once the total variability space is calculated, you can calculate the i-vectors as [4]:

$$\mathit{w}={\left(\mathit{I}+{\mathit{T}}^{\prime}{\Sigma}^{-1\text{\hspace{0.17em}}}\mathrm{NT}\right)}^{\prime}{\mathit{T}}^{\prime}{\Sigma}^{-1\text{\hspace{0.17em}}}\mathit{F}$$

At this point, you are still considering each training file as a separate speaker. However, in the next step, when you train a projection matrix to reduce dimensionality and increase inter-speaker differences, the i-vectors must be labeled with the appropriate, distinct speaker IDs.

Create a cell array where each element of the cell array contains a matrix of i-vectors across files for a particular speaker.

speakers = unique(adsTrain.Labels); numSpeakers = numel(speakers); ivectorPerSpeaker = cell(numSpeakers,1); TS = T./Sigma; TSi = TS'; ubmMu = ubm.mu; tic parfor speakerIdx = 1:numSpeakers % Subset the datastore to the speaker you are adapting. adsPart = subset(adsTrain,adsTrain.Labels==speakers(speakerIdx)); numFiles = numel(adsPart.Files); ivectorPerFile = zeros(numTdim,numFiles); for fileIdx = 1:numFiles audioData = read(adsPart); % Extract features Y = helperFeatureExtraction(audioData,afe,normFactors); % Compute a posteriori log-likelihood logLikelihood = helperGMMLogLikelihood(Y,ubm); % Compute a posteriori normalized probability amax = max(logLikelihood,[],1); logLikelihoodSum = amax + log(sum(exp(logLikelihood-amax),1)); gamma = exp(logLikelihood - logLikelihoodSum)'; % Compute Baum-Welch statistics n = sum(gamma,1); f = Y * gamma - n.*(ubmMu); ivectorPerFile(:,fileIdx) = pinv(I + (TS.*repelem(n(:),numFeatures))' * T) * TSi * f(:); end ivectorPerSpeaker{speakerIdx} = ivectorPerFile; end fprintf('I-vectors extracted from training set (%0.0f seconds).\n',toc)

I-vectors extracted from training set (222 seconds).

Many different backends have been proposed for i-vectors. The most straightforward and still well-performing one is the combination of linear discriminant analysis (LDA) and within-class covariance normalization (WCCN).

Create a matrix of the training vectors and a map indicating which i-vector corresponds to which speaker. Initialize the projection matrix as an identity matrix.

w = ivectorPerSpeaker; utterancePerSpeaker = cellfun(@(x)size(x,2),w); ivectorsTrain = cat(2,w{:}); projectionMatrix = eye(size(w{1},1));

LDA attempts to minimize the intra-class variance and maximize the variance between speakers. It can be calculated as outlined in [4]:

**Given**:

$${\mathit{S}}_{\mathit{b}}=\sum _{\mathit{s}=1}^{\mathit{S}}\left(\stackrel{\u203e}{{\mathit{w}}_{\mathit{s}}}-\stackrel{\u203e}{\mathit{w}}\right){\left(\stackrel{\u203e}{{\mathit{w}}_{\mathit{s}}}-\stackrel{\u203e}{\mathit{w}}\right)}^{\prime}$$

$${\mathit{S}}_{\mathit{w}}=\sum _{\mathit{s}=1}^{\mathit{S}}\frac{1}{{\mathit{n}}_{\mathit{s}}}\sum _{\mathit{i}=1}^{{\mathit{n}}_{\mathit{s}}}\left({\mathit{w}}_{\mathit{i}}^{\mathit{s}}-\stackrel{\u203e}{{\mathit{w}}_{\mathit{s}}}\right){\left({\mathit{w}}_{\mathit{i}}^{\mathit{s}}-\stackrel{\u203e}{{\mathit{w}}_{\mathit{s}}}\right)}^{\prime}$$

where

$\stackrel{\u203e}{{\mathit{w}}_{\mathit{s}}}=\left(\frac{1}{{\mathit{n}}_{\mathit{s}}}\right){\sum}_{\mathit{i}=1}^{{\mathit{n}}_{\mathit{s}}}{\mathit{w}}_{\mathit{i}}^{\mathit{s}}$ is the mean of i-vectors for each speaker.

$\stackrel{\u203e}{\mathit{w}}=\frac{1}{\mathit{N}}{\sum}_{\mathit{s}=1}^{\mathit{S}}{\sum}_{\mathit{i}=1}^{{\mathit{n}}_{\mathit{s}}}{\mathit{w}}_{\mathit{i}}^{\mathit{s}}$ is the mean i-vector across all speakers.

${\mathit{n}}_{\mathit{s}}$ is the number of utterances for each speaker.

**Solve** the eigenvalue equation for the best eigenvectors:

$${\mathit{S}}_{\mathit{b}}\mathit{v}=\lambda \text{\hspace{0.17em}}{\mathit{S}}_{\mathit{w}}\mathit{v}$$

The best eigenvectors are those with the highest eigenvalues.

performLDA = true; if performLDA tic numEigenvectors = 16; Sw = zeros(size(projectionMatrix,1)); Sb = zeros(size(projectionMatrix,1)); wbar = mean(cat(2,w{:}),2); for ii = 1:numel(w) ws = w{ii}; wsbar = mean(ws,2); Sb = Sb + (wsbar - wbar)*(wsbar - wbar)'; Sw = Sw + cov(ws',1); end [A,~] = eigs(Sb,Sw,numEigenvectors); A = (A./vecnorm(A))'; ivectorsTrain = A * ivectorsTrain; w = mat2cell(ivectorsTrain,size(ivectorsTrain,1),utterancePerSpeaker); projectionMatrix = A * projectionMatrix; fprintf('LDA projection matrix calculated (%0.2f seconds).',toc) end

LDA projection matrix calculated (0.41 seconds).

WCCN attempts to scale the i-vector space inversely to the in-class covariance, so that directions of high intra-speaker variability are de-emphasized in i-vector comparisons [9].

**Given** the within-class covariance matrix:

$$\mathit{W}=\frac{1}{\mathit{S}}\sum _{\mathit{s}=1}^{\mathit{S}}\frac{1}{{\mathit{n}}_{\mathit{s}}}\sum _{\mathit{i}=1}^{{\mathit{n}}_{\mathit{s}}}\left({\mathit{w}}_{\mathit{i}}^{\mathit{s}}-\stackrel{\u203e}{{\mathit{w}}_{\mathit{s}}}\right){\left({\mathit{w}}_{\mathit{i}}^{\mathit{s}}-\stackrel{\u203e}{{\mathit{w}}_{\mathit{s}}}\right)}^{\prime}$$

where

$\stackrel{\u203e}{{\mathit{w}}_{\mathit{s}}}=\left(\frac{1}{{\mathit{n}}_{\mathit{s}}}\right){\sum}_{\mathit{i}=1}^{{\mathit{n}}_{\mathit{s}}}{\mathit{w}}_{\mathit{i}}^{\mathit{s}}$ is the mean of i-vectors for each speaker.

${\mathit{n}}_{\mathit{s}}$ is the number of utterances for each speaker.

**Solve** for B using Cholesky decomposition:

$${\mathit{W}}^{-1}={\mathrm{BB}}^{\prime}$$

performWCCN = true; if performWCCN tic alpha = 0.9; W = zeros(size(projectionMatrix,1)); for ii = 1:numel(w) W = W + cov(w{ii}',1); end W = W/numel(w); W = (1 - alpha)*W + alpha*eye(size(W,1)); B = chol(pinv(W),'lower'); projectionMatrix = B * projectionMatrix; fprintf('WCCN projection matrix calculated (%0.4f seconds).',toc) end

WCCN projection matrix calculated (0.0076 seconds).

The training stage is now complete. You can now use the universal background model (UBM), total variability space (T), and projection matrix to enroll and verify speakers.

Enroll new speakers that were not in the training data set. First, split the `adsEnrollAndVerify`

audio datastore object into enroll and verify. Use 20 utterances per speaker for enrollment, and the rest for testing. Increasing the number of utterances per speaker for enrollment should increase the performance of the system.

```
numFilesPerSpeakerForEnrollment = 20;
[adsEnroll,adsVerify] = splitEachLabel(adsEnrollAndVerify,numFilesPerSpeakerForEnrollment);
adsVerify = shuffle(adsVerify);
adsEnroll = shuffle(adsEnroll);
countEachLabel(adsEnroll)
```

`ans=`*4×2 table*
Label Count
_____ _____
F01 20
F05 20
M01 20
M05 20

countEachLabel(adsVerify)

`ans=`*4×2 table*
Label Count
_____ _____
F01 216
F05 216
M01 216
M05 216

Create i-vectors for each file for each speaker in the enroll set using the this sequence of steps:

Feature Extraction

Baum-Welch Statistics: Determine the zeroth and first order statistics

i-vector Extraction

Intersession compensation

Then average the i-vectors across files to create an i-vector model for the speaker.

speakers = unique(adsEnroll.Labels); numSpeakers = numel(speakers); enrolledSpeakersByIdx = cell(numSpeakers,1); tic parfor speakerIdx = 1:numSpeakers % Subset the datastore to the speaker you are adapting. adsPart = subset(adsEnroll,adsEnroll.Labels==speakers(speakerIdx)); numFiles = numel(adsPart.Files); ivectorMat = zeros(size(projectionMatrix,1),numFiles); for fileIdx = 1:numFiles audioData = read(adsPart); % Extract features Y = helperFeatureExtraction(audioData,afe,normFactors); % Compute a posteriori log-likelihood logLikelihood = helperGMMLogLikelihood(Y,ubm); % Compute a posteriori normalized probability amax = max(logLikelihood,[],1); logLikelihoodSum = amax + log(sum(exp(logLikelihood-amax),1)); gamma = exp(logLikelihood - logLikelihoodSum)'; % Compute Baum-Welch statistics n = sum(gamma,1); f = Y * gamma - n.*(ubmMu); %i-vector Extraction ivector = pinv(I + (TS.*repelem(n(:),numFeatures))' * T) * TSi * f(:); % Intersession Compensation ivector = projectionMatrix*ivector; ivectorMat(:,fileIdx) = ivector; end % i-vector model enrolledSpeakersByIdx{speakerIdx} = mean(ivectorMat,2); end fprintf('Speakers enrolled (%0.0f seconds).\n',toc)

Speakers enrolled (7 seconds).

For bookkeeping purposes, convert the cell array of i-vectors to a structure, with the speaker IDs as fields and the i-vectors as values

enrolledSpeakers = struct; for s = 1:numSpeakers enrolledSpeakers.(string(speakers(s))) = enrolledSpeakersByIdx{s}; end

The speaker false rejection rate (FRR) is the rate that a given speaker is incorrectly rejected. Use the verification speaker data set to determine the speaker FRR for a set of thresholds.

speakersToTest = unique(adsVerify.Labels); numSpeakers = numel(speakersToTest); thresholdsToTest = -1:0.0001:1; cssFRR = cell(numSpeakers,1); tic parfor speakerIdx = 1:numSpeakers adsPart = subset(adsVerify,adsVerify.Labels==speakersToTest(speakerIdx)); numFiles = numel(adsPart.Files); ivectorToTest = enrolledSpeakers.(string(speakersToTest(speakerIdx))); %#ok<PFBNS> css = zeros(numFiles,1); for fileIdx = 1:numFiles audioData = read(adsPart); % Extract features Y = helperFeatureExtraction(audioData,afe,normFactors); % Compute a posteriori log-likelihood logLikelihood = helperGMMLogLikelihood(Y,ubm); % Compute a posteriori normalized probability amax = max(logLikelihood,[],1); logLikelihoodSum = amax + log(sum(exp(logLikelihood-amax),1)); gamma = exp(logLikelihood - logLikelihoodSum)'; % Compute Baum-Welch statistics n = sum(gamma,1); f = Y * gamma - n.*(ubmMu); % Extract i-vector ivector = pinv(I + (TS.*repelem(n(:),numFeatures))' * T) * TSi * f(:); % Intersession Compensation ivector = projectionMatrix*ivector; % Cosine Similarity Score css(fileIdx) = dot(ivectorToTest,ivector)/(norm(ivector)*norm(ivectorToTest)); end cssFRR{speakerIdx} = css; end cssFRR = cat(1,cssFRR{:}); FRR = mean(cssFRR<thresholdsToTest); fprintf('FRR calculated (%0.0f seconds).\n',toc)

FRR calculated (65 seconds).

Plot the FRR as a function of the threshold.

figure plot(thresholdsToTest,FRR) title('False Rejection Rate (FRR)') xlabel('Threshold') ylabel('FRR') grid on axis([thresholdsToTest(find(FRR~=0,1)) thresholdsToTest(find(FRR==1,1)) 0 1])

The speaker false acceptance rate (FAR) is the rate that utterances not belonging to an enrolled speaker are incorrectly accepted as belonging to the enrolled speaker. Use the known speaker set to determine the speaker FAR for a set of thresholds.

speakersToTest = unique(adsVerify.Labels); numSpeakers = numel(speakersToTest); cssFAR = cell(numSpeakers,1); totalFiles = 0; tic parfor speakerIdx = 1:numSpeakers adsPart = subset(adsVerify,adsVerify.Labels~=speakersToTest(speakerIdx)); numFiles = numel(adsPart.Files); ivectorToTest = enrolledSpeakers.(string(speakersToTest(speakerIdx))); %#ok<PFBNS> css = zeros(numFiles,1); for fileIdx = 1:numFiles audioData = read(adsPart); % Extract features Y = helperFeatureExtraction(audioData,afe,normFactors); % Compute a posteriori log-likelihood logLikelihood = helperGMMLogLikelihood(Y,ubm); % Compute a posteriori normalized probability amax = max(logLikelihood,[],1); logLikelihoodSum = amax + log(sum(exp(logLikelihood-amax),1)); gamma = exp(logLikelihood - logLikelihoodSum)'; % Compute Baum-Welch statistics n = sum(gamma,1); f = Y * gamma - n.*(ubmMu); % Extract i-vector ivector = pinv(I + (TS.*repelem(n(:),numFeatures))' * T) * TSi * f(:); % Intersession compensation ivector = projectionMatrix * ivector; % Cosine similarity score css(fileIdx) = dot(ivectorToTest,ivector)/(norm(ivector)*norm(ivectorToTest)); end cssFAR{speakerIdx} = css; end cssFAR = cat(1,cssFAR{:}); FAR = mean(cssFAR>thresholdsToTest); fprintf('FAR calculated (%0.0f seconds).\n',toc);

FAR calculated (190 seconds).

Plot the FAR as a function of the threshold.

figure plot(thresholdsToTest,FAR) title('False Acceptance Rate (FAR)') xlabel('Threshold') ylabel('FAR') grid on axis([thresholdsToTest(find(FAR~=1,1)) thresholdsToTest(find(FAR==0,1)) 0 1])

As you move the threshold in a speaker verification system, you trade off between FAR and FRR. This is referred to as the detection error tradeoff and is commonly reported in binary classification problems.

figure plot(FAR,FRR) grid on xlabel('False Acceptance Rate (FAR)') ylabel('False Rejection Rate (FRR)') title('Detection Error Tradeoff (DET) Curve') axis([0 0.5 0 0.5])

To compare multiple systems, you need a single metric that combines the FAR and FRR performance. For this, you determine the equal error rate (EER), which is the threshold where the FAR and FRR curves meet. In practice, the EER threshold might not be the best choice. For example, if speaker verification is used as part of a multi-authentication approach for wire transfers, FAR would most likely be more heavily weighted than FRR.

[~,EERThresholdIdx] = min(abs(FAR - FRR)); EERThreshold = thresholdsToTest(EERThresholdIdx); EER = mean([FAR(EERThresholdIdx),FRR(EERThresholdIdx)]); figure plot(thresholdsToTest,FAR,'k', ... thresholdsToTest,FRR,'b', ... EERThreshold,EER,'ro','MarkerFaceColor','r') title(sprintf('Equal Error Rate = %0.4f, Threshold = %0.4f',EER,EERThreshold)) xlabel('Threshold') ylabel('Error Rate') legend('False Acceptance Rate (FAR)','False Rejection Rate (FRR)','Equal Error Rate (EER)','Location','southwest') grid on axis([thresholdsToTest(find(FAR~=1,1)) thresholdsToTest(find(FRR==1,1)) 0 1])

function [features,numFrames] = helperFeatureExtraction(audioData,afe,normFactors) % Input: % audioData - column vector of audio data % afe - audioFeatureExtractor object % normFactors - mean and standard deviation of the features used for normalization. % If normFactors is empty, no normalization is applied. % % Output % features - matrix of features extracted % numFrames - number of frames (feature vectors) returned % Normalize audioData = audioData/max(abs(audioData(:))); % Protect against NaNs audioData(isnan(audioData)) = 0; % Isolate speech segment idx = detectSpeech(audioData,afe.SampleRate); features = []; for ii = 1:size(idx,1) f = extract(afe,audioData(idx(ii,1):idx(ii,2))); features = [features;f]; %#ok<AGROW> end % Feature normalization if ~isempty(normFactors) features = (features-normFactors.Mean')./normFactors.STD'; end features = features'; % Cepstral mean subtraction (for channel noise) if ~isempty(normFactors) features = features - mean(features,'all'); end numFrames = size(features,2); end

function L = helperGMMLogLikelihood(x,gmm) xMinusMu = repmat(x,1,1,numel(gmm.ComponentProportion)) - permute(gmm.mu,[1,3,2]); permuteSigma = permute(gmm.sigma,[1,3,2]); Lunweighted = -0.5*(sum(log(permuteSigma),1) + sum(xMinusMu.*(xMinusMu./permuteSigma),1) + size(gmm.mu,1)*log(2*pi)); temp = squeeze(permute(Lunweighted,[1,3,2])); if size(temp,1)==1 % If there is only one frame, the trailing singleton dimension was % removed in the permute. This accounts for that edge case. temp = temp'; end L = temp + log(gmm.ComponentProportion)'; end

[1] Reynolds, Douglas A., et al. “Speaker Verification Using Adapted Gaussian Mixture Models.” *Digital Signal Processing*, vol. 10, no. 1–3, Jan. 2000, pp. 19–41. *DOI.org (Crossref)*, doi:10.1006/dspr.1999.0361.

[2] Kenny, Patrick, et al. “Joint Factor Analysis Versus Eigenchannels in Speaker Recognition.” *IEEE Transactions on Audio, Speech and Language Processing*, vol. 15, no. 4, May 2007, pp. 1435–47. *DOI.org (Crossref)*, doi:10.1109/TASL.2006.881693.

[3] Kenny, P., et al. “A Study of Interspeaker Variability in Speaker Verification.” *IEEE Transactions on Audio, Speech, and Language Processing*, vol. 16, no. 5, July 2008, pp. 980–88. *DOI.org (Crossref)*, doi:10.1109/TASL.2008.925147.

[4] Dehak, Najim, et al. “Front-End Factor Analysis for Speaker Verification.” *IEEE Transactions on Audio, Speech, and Language Processing*, vol. 19, no. 4, May 2011, pp. 788–98. *DOI.org (Crossref)*, doi:10.1109/TASL.2010.2064307.

[5] Matějka, P., et al. "Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification," *2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, Prague, 2011, pp. 4828-4831.

[6] Snyder, David, et al. “X-Vectors: Robust DNN Embeddings for Speaker Recognition.” *2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, IEEE, 2018, pp. 5329–33. *DOI.org (Crossref)*, doi:10.1109/ICASSP.2018.8461375.

[7] Signal Processing and Speech Communication Laboratory. Accessed December 12, 2019. https://www.spsc.tugraz.at/databases-and-tools/ptdb-tug-pitch-tracking-database-from-graz-university-of-technology.html.

[8] Variani, Ehsan, et al. “Deep Neural Networks for Small Footprint Text-Dependent Speaker Verification.” *2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, IEEE, 2014, pp. 4052–56. *DOI.org (Crossref)*, doi:10.1109/ICASSP.2014.6854363.

[9] Dehak, Najim, Réda Dehak, James R. Glass, Douglas A. Reynolds and Patrick Kenny. “Cosine Similarity Scoring without Score Normalization Techniques.” *Odyssey* (2010).

[10] Verma, Pulkit, and Pradip K. Das. “I-Vectors in Speech Processing Applications: A Survey.” *International Journal of Speech Technology*, vol. 18, no. 4, Dec. 2015, pp. 529–46. *DOI.org (Crossref)*, doi:10.1007/s10772-015-9295-3.