Define Text Encoder Model Function

This example uses:

This example shows how to define a text encoder model function.

In the context of deep learning, an encoder is the part of a deep learning network that maps the input to some latent space. You can use these vectors for various tasks. For example,

Classification by applying a softmax operation to the encoded data and using cross entropy loss.
Sequence-to-sequence translation by using the encoded vector as a context vector.

Load Data

The file sonnets.txt contains all of Shakespeare's sonnets in a single text file.

Read the Shakespeare's Sonnets data from the file "sonnets.txt".

filename = "sonnets.txt";
textData = fileread(filename);

The sonnets are indented by two whitespace characters. Remove the indentations using replace and split the text into separate lines using the split function. Remove the header from the first nine elements and the short sonnet titles.

textData = replace(textData,"  ","");
textData = split(textData,newline);
textData(1:9) = [];
textData(strlength(textData)<5) = [];

Prepare Data

Create a function that tokenizes and preprocesses the text data. The function preprocessText, listed at the end of the example, performs these steps:

Prepends and appends each input string with the specified start and stop tokens, respectively.
Tokenize the text using tokenizedDocument.

Preprocess the text data and specify the start and stop tokens "<start>" and "<stop>", respectively.

startToken = "<start>";
stopToken = "<stop>";
documents = preprocessText(textData,startToken,stopToken);

Create a word encoding object from the tokenized documents.

enc = wordEncoding(documents);

When training a deep learning model, the input data must be a numeric array containing sequences of a fixed length. Because the documents have different lengths, you must pad the shorter sequences with a padding value.

Recreate the word encoding to also include a padding token and determine the index of that token.

paddingToken = "<pad>";
newVocabulary = [enc.Vocabulary paddingToken];
enc = wordEncoding(newVocabulary);
paddingIdx = word2ind(enc,paddingToken)

paddingIdx = 
3595

Initialize Model Parameters

The goal of the encoder is to map sequences of word indices to vectors in some latent space.

Initialize the parameters for the following model.

This model uses three operations:

The embedding maps word indices in the range 1 though vocabularySize to vectors of dimension embeddingDimension, where vocabularySize is the number of words in the encoding vocabulary and embeddingDimension is the number of components learned by the embedding.
The LSTM operation takes as input sequences of word vectors and outputs 1-by-numHiddenUnits vectors, where numHiddenUnits is the number of hidden units in the LSTM operation.
The fully connected operation multiplies the input by a weight matrix adding bias and outputs vectors of size latentDimension, where latentDimension is the dimension of the latent space.

Specify the dimensions of the parameters.

embeddingDimension = 100;
numHiddenUnits = 150;
latentDimension = 50;
vocabularySize = enc.NumWords;

Create a struct for the parameters.

parameters = struct;

Initialize the weights of the embedding using the Gaussian using the initializeGaussian function which is attached to this example as a supporting file. Specify a mean of 0 and a standard deviation of 0.01. To learn more, see Gaussian Initialization (Deep Learning Toolbox).

mu = 0;
sigma = 0.01;
parameters.emb.Weights = initializeGaussian([embeddingDimension vocabularySize],mu,sigma);

Initialize the learnable parameters for the encoder LSTM operation:

Initialize the input weights with the Glorot initializer using the initializeGlorot function which is attached to this example as a supporting file. To learn more, see Glorot Initialization (Deep Learning Toolbox).
Initialize the recurrent weights with the orthogonal initializer using the initializeOrthogonal function which is attached to this example as a supporting file. To learn more, see Orthogonal Initialization (Deep Learning Toolbox).
Initialize the bias with the unit forget gate initializer using the initializeUnitForgetGate function which is attached to this example as a supporting file. To learn more, see Unit Forget Gate Initialization (Deep Learning Toolbox).

The sizes of the learnable parameters depend on the size of the input. Because the inputs to the LSTM operation are sequences of word vectors from the embedding operation, the number of input channels is embeddingDimension.

The input weight matrix has size 4*numHiddenUnits-by-inputSize, where inputSize is the dimension of the input data.
The recurrent weight matrix has size 4*numHiddenUnits-by-numHiddenUnits.
The bias vector has size 4*numHiddenUnits-by-1.

sz = [4*numHiddenUnits embeddingDimension];
numOut = 4*numHiddenUnits;
numIn = embeddingDimension;

parameters.lstmEncoder.InputWeights = initializeGlorot(sz,numOut,numIn);
parameters.lstmEncoder.RecurrentWeights = initializeOrthogonal([4*numHiddenUnits numHiddenUnits]);
parameters.lstmEncoder.Bias = initializeUnitForgetGate(numHiddenUnits);

Initialize the learnable parameters for the encoder fully connected operation:

Initialize the weights with the Glorot initializer.
Initialize the bias with zeros using the initializeZeros function which is attached to this example as a supporting file. To learn more, see Zeros Initialization (Deep Learning Toolbox).

The sizes of the learnable parameters depend on the size of the input. Because the inputs to the fully connected operation are the outputs of the LSTM operation, the number of input channels is numHiddenUnits. To make the fully connected operation output vectors with size latentDimension, specify an output size of latentDimension.

The weights matrix has size outputSize-by-inputSize, where outputSize and inputSize correspond to the output and input dimensions, respectively.
The bias vector has size outputSize-by-1.

sz = [latentDimension numHiddenUnits];
numOut = latentDimension;
numIn = numHiddenUnits;

parameters.fcEncoder.Weights = initializeGlorot(sz,numOut,numIn);
parameters.fcEncoder.Bias = initializeZeros([latentDimension 1]);

Define Model Encoder Function

Create the function modelEncoder, listed in the Encoder Model Function section of the example, that computes the output of the encoder model. The modelEncoder function, takes as input sequences of word indices, the model parameters, and the sequence lengths, and returns the corresponding latent feature vector.

Prepare Mini-Batch of Data

To train the model using a custom training loop, you must iterate over mini-batches of data and convert it into the format required for the encoder model and the model gradients functions. This section of the example illustrates the steps needed for preparing a mini-batch of data inside the custom training loop.

Prepare an example mini-batch of data. Select a mini-batch of 32 documents from documents. This represents the mini-batch of data used in an iteration of a custom training loop.

miniBatchSize = 32;
idx = 1:miniBatchSize;
documentsBatch = documents(idx);

Convert the documents to sequences using the doc2sequence function and specify to right-pad the sequences with the word index corresponding to the padding token.

X = doc2sequence(enc,documentsBatch, ...
    PaddingDirection="right", ...
    PaddingValue=paddingIdx);

The output of the doc2sequence function is a cell array, where each element is a row vector of word indices. Because the encoder model function requires numeric input, concatenate the rows of the data using the cat function and specify to concatenate along the first dimension. The output has size miniBatchSize-by-sequenceLength, where sequenceLength is the length of the longest sequence in the mini-batch.

X = cat(1,X{:});
size(X)

ans = 1×2

    32    14

Convert the data to a dlarray with format "BTC" (batch, time, channel). The software automatically rearranges the output to have format "CTB" so the output has size 1-by-miniBatchSize-by-sequenceLength.

X = dlarray(X,'BTC');
size(X)

ans = 1×3

    1    32    14

For masking, calculate the unpadded sequence lengths of the input data using the doclength function with the mini-batch of documents as input.

sequenceLengths = doclength(documentsBatch);

This code snippet shows an example of preparing a mini-batch in a custom training loop.

iteration = 0;

% Loop over epochs.
for epoch = 1:numEpochs

    % Loop over mini-batches.
    for i = 1:numIterationsPerEpoch

        iteration = iteration + 1;

        % Read mini-batch.
        idx = (i-1)*miniBatchSize+1:i*miniBatchSize;
        documentsBatch = documents(idx);

        % Convert to sequences.
        X = doc2sequence(enc,documentsBatch, ...
            PaddingDirection="right", ...
            PaddingValue=paddingIdx);
        X = cat(1,X{:});

        % Convert to dlarray.
        X = dlarray(X,"BTC");

        % Calculate sequence lengths.
        sequenceLengths = doclength(documentsBatch);

        % Evaluate model gradients.
        % ...

        % Update learnable parameters.
        % ...
    end
end

Use Model Function in Model Loss Function

When training a deep learning model with a custom training loop, you must calculate the loss and the gradients of the loss with respect to the learnable parameters. This calculation depends on the output of a forward pass of the model function.

To perform a forward pass of the encoder, use the modelEncoder function directly with the parameters, data, and sequence lengths as input. The output is a latentDimension-by-miniBatchSize matrix.

Z = modelEncoder(parameters,X,sequenceLengths);
size(Z)

ans = 1×2

    50    32

This code snippet shows an example of using a model encoder function inside the model gradients function.

function [loss,gradients] = modelLoss(parameters,X,sequenceLengths)
    
    Z = modelEncoder(parameters,X,sequenceLengths);

    % Calculate loss.
    % ...

    % Calculate gradients.
    % ...

end

This code snippet shows an example of evaluating the model gradients in a custom training loop.

iteration = 0;

% Loop over epochs.
for epoch = 1:numEpochs

    % Loop over mini-batches.
    for i = 1:numIterationsPerEpoch
        iteration = iteration + 1;

        % Prepare mini-batch.
        % ...

        % Evaluate model gradients.
        [loss,gradients] = dlfeval(@modelLoss, parameters, X, sequenceLengths);

        % Update learnable parameters.
        [parameters,trailingAvg,trailingAvgSq] = adamupdate(parameters,gradients, ...
            trailingAvg,trailingAvgSq,iteration);
    end
end

Encoder Model Function

The modelEncoder function, takes as input the model parameters, sequences of word indices, and the sequence lengths, and returns the corresponding latent feature vector.

Because the input data contains padded sequences of different lengths, the padding can have adverse effects on loss calculations. For the LSTM operation, instead of returning the output of the last time step of the sequence (which likely corresponds to the LSTM state after processing lots of padding values), determine the actual last time step given by the sequenceLengths input.

function Z = modelEncoder(parameters,X,sequenceLengths)

% Embedding.
weights = parameters.emb.Weights;
Z = embed(X,weights);

% LSTM.
inputWeights = parameters.lstmEncoder.InputWeights;
recurrentWeights = parameters.lstmEncoder.RecurrentWeights;
bias = parameters.lstmEncoder.Bias;

numHiddenUnits = size(recurrentWeights,2);
hiddenState = zeros(numHiddenUnits,1,"like",X);
cellState = zeros(numHiddenUnits,1,"like",X);

Z1 = lstm(Z,hiddenState,cellState,inputWeights,recurrentWeights,bias);

% Output mode "last" with masking.
miniBatchSize = size(Z1,2);
Z = zeros(numHiddenUnits,miniBatchSize,"like",Z1);
Z = dlarray(Z,"CB");

for n = 1:miniBatchSize
    t = sequenceLengths(n);
    Z(:,n) = Z1(:,n,t);
end

% Fully connect.
weights = parameters.fcEncoder.Weights;
bias = parameters.fcEncoder.Bias;
Z = fullyconnect(Z,weights,bias);

end

Preprocessing Function

The function preprocessText performs these steps:

Prepends and appends each input string with the specified start and stop tokens, respectively.
Tokenize the text using tokenizedDocument.

function documents = preprocessText(textData,startToken,stopToken)

% Add start and stop tokens.
textData = startToken + textData + stopToken;

% Tokenize the text.
documents = tokenizedDocument(textData,'CustomTokens',[startToken stopToken]);

end