Pride and Prejudice and MATLAB

This example uses:

This example shows how to train a deep learning LSTM network to generate text using character embeddings.

To train a deep learning network for text generation, train a sequence-to-sequence LSTM network to predict the next character in a sequence of characters. To train the network to predict the next character, specify the responses to be the input sequences shifted by one time step.

To use character embeddings, convert each training observation to a sequence of integers, where the integers index into a vocabulary of characters. Include a word embedding layer in the network which learns an embedding of the characters and maps the integers to vectors.

Load Training Data

Read the HTML code from The Project Gutenberg EBook of Pride and Prejudice, by Jane Austen and parse it using webread and htmlTree.

url = "https://www.gutenberg.org/files/1342/1342-h/1342-h.htm";
code = webread(url);
tree = htmlTree(code);

Extract the paragraphs by finding the p elements. To remove information related to the table of content, ignore paragraph elements with class "toc" by using the CSS selector ':not(.toc)'.

paragraphs = findElement(tree,'p:not(.toc)');

Extract the text data from the paragraphs using the extractHTMLText function. Remove any empty strings.

textData = extractHTMLText(paragraphs);
textData(textData == "") = [];

Remove strings shorter than 20 characters.

idx = strlength(textData) < 20;
textData(idx) = [];

Visualize the text data in a word cloud.

figure
wordcloud(textData);
title("Pride and Prejudice")

Convert Text Data to Sequences

Convert the text data to sequences of character indices for the predictors and categorical sequences for the responses.

The categorical function treats newline and whitespace entries as undefined. To create categorical elements for these characters, replace them with the special characters "¶" (pilcrow, "\x00B6") and "·" (middle dot, "\x00B7") respectively. To prevent ambiguity, choose special characters that do not appear in the text.

newlineCharacter = compose("\x00B6");
whitespaceCharacter = compose("\x00B7");
textData = replace(textData,[newline " "],[newlineCharacter whitespaceCharacter]);

Loop over the text data and create a sequence of character indices representing the characters of each observation and a categorical sequence of characters for the responses. To denote the end of each observation, include the special character "␃" (end of text, "\x2403"). Specify the categories of the categorical arrays to be all of the characters that appear in the text data.

endOfTextCharacter = compose("\x2403");
numDocuments = numel(textData);
uniqueCharacters = unique([textData{:}]);
for i = 1:numDocuments
    characters = textData{i};
    X = double(characters);    
    % Create vector of categorical responses with end of text character.
    charactersShifted = [cellstr(characters(2:end)')' endOfTextCharacter];
    Y = categorical(charactersShifted, [string(uniqueCharacters'); endOfTextCharacter]);
    XTrain{i} = X;
    YTrain{i} = Y;
end

During training, by default, the software splits the training data into mini-batches and pads the sequences so that they have the same length. Too much padding can have a negative impact on the network performance.

To prevent the training process from adding too much padding, you can sort the training data by sequence length. Doing so means that sequences in any given mini-batch have a similar length.

numObservations = numel(XTrain);
for i=1:numObservations
    sequence = XTrain{i};
    sequenceLengths(i) = size(sequence,2);
end
[~,idx] = sort(sequenceLengths);
XTrain = XTrain(idx);
YTrain = YTrain(idx);

Create and Train LSTM Network

Define the LSTM architecture. Specify a sequence-to-sequence LSTM classification network with 400 hidden units. Set the input size to be the feature dimension of the training data. For sequences of character indices, the feature dimension is 1. Specify a word embedding layer with dimension 200 and specify the number of words (which correspond to characters) to be the highest character value in the input data. Set the output size of the fully connected layer to be the number of categories in the responses. To help prevent overfitting, include a dropout layer after the LSTM layer.

The word embedding layer learns an embedding of characters and maps each character to a 200-dimension vector.

inputSize = size(XTrain{1},1);
classes = categories([YTrain{:}]);
numClasses = numel(classes);
numCharacters = max([textData{:}]);

layers = [
    sequenceInputLayer(inputSize)
    wordEmbeddingLayer(200,numCharacters)
    lstmLayer(400,OutputMode="sequence")
    dropoutLayer(0.2);
    fullyConnectedLayer(numClasses)
    softmaxLayer];

Specify the training options. Specify to train with a mini-batch size of 32 and initial learn rate 0.01. To prevent the gradients from exploding, set the gradient threshold to 1. To ensure the data remains sorted, set Shuffle to "never". To monitor the training progress, set the Plots option to "training-progress". To suppress verbose output, set Verbose to false. Because the training data has sequences with rows and columns corresponding to channels and time steps, respectively, specify the input and target data formats "CTB" (channel, time, batch).

options = trainingOptions("adam", ...
    InputDataFormats = "CTB", ...
    TargetDataFormats = "CTB", ...
    Metrics = "accuracy", ...
    MiniBatchSize = 32,...
    InitialLearnRate = 0.01, ...
    GradientThreshold = 0.1, ...
    Shuffle = "never", ...
    Plots = "training-progress", ...
    Verbose = false, ...
    ExecutionEnvironment = "auto");

Train the network using the trainnet (Deep Learning Toolbox) function.

net = trainnet(XTrain,YTrain,layers,"crossentropy",options);

Generate New Text

Generate the first character of the text by sampling a character from a probability distribution according to the first characters of the text in the training data. Generate the remaining characters by using the trained LSTM network to predict the next sequence using the current sequence of generated text. Keep generating characters one-by-one until the network predicts the "end of text" character.

Sample the first character according to the distribution of the first characters in the training data.

initialCharacters = extractBefore(textData,2);
firstCharacter = datasample(initialCharacters,1);
generatedText = firstCharacter;

Convert the first character to a numeric index.

X = double(char(firstCharacter));

For the remaining predictions, sample the next character according to the prediction scores of the network. The prediction scores represent the probability distribution of the next character. Sample the characters from the vocabulary of characters given by the class names of the output layer of the network. Get the vocabulary from the training data.

vocabulary = string(classes);

Make predictions character by character using predict (Deep Learning Toolbox). For each prediction, input the index of the previous character. Stop predicting when the network predicts the end of text character or when the generated text is 500 characters long.

maxLength = 500;
while strlength(generatedText) < maxLength
    % Predict the next character scores and output the network state.
    [characterScores,state] = predict(net,X);

    % Update the state.
    net.State = state;
    
    % Sample the next character.
    newCharacter = datasample(vocabulary,1,Weights=gather(characterScores));
    
    % Stop predicting at the end of text.
    if newCharacter == endOfTextCharacter
        break
    end
    
    % Add the character to the generated text.
    generatedText = generatedText + newCharacter;
    
    % Get the numeric index of the character.
    X = double(char(newCharacter));
end

Reconstruct the generated text by replacing the special characters with their corresponding whitespace and new line characters.

generatedText = replace(generatedText,[newlineCharacter whitespaceCharacter],[newline " "])

generatedText = 
"“You dread, in must at Mr. Darcy’s more she had coldured with, to pubid since mistaken part of their return out general futual limentable or town Mact last ledge of the whire of these attended more longer impostible to looke able. They need marely bedre. Be all looking enough to this vortimily before, shook back beauty, she could inte; and moresning to it parton, that which she had out also on Jane, though to Loss arain Miss depenting, as mightmon that chuel to Mr. Darcy’s much man’s po rithment"

To generate multiple pieces of text, reset the network state between generations using resetState (Deep Learning Toolbox).

net = resetState(net);

Pride and Prejudice and MATLAB

Load Training Data

Convert Text Data to Sequences

Create and Train LSTM Network

Generate New Text

See Also

Topics