Embed a mini-batch of text data.

Create an array of tokenized documents.

To encode text data as sequences of numeric indices, create a `wordEncoding`

object.

Initialize the embedding weights. Specify an embedding dimension of 100, and a vocabulary size to be consistent with the vocabulary size corresponding to the number of words in the word encoding plus one.

Convert the tokenized documents to sequences of word vectors using the `doc2sequence`

function. The `doc2sequence`

function, by default, discards out-of-vocabulary tokens in the input data. To map out-of-vocabulary tokens to the last vector of embedding weights, set the `'UnknownWord'`

option to `'nan'`

. The `doc2sequence`

function, by default, left-pads the input sequences with zeros to have the same length

sequences=*2×1 cell array*
{[ 0 1 2 3 4 5 6 7 8 9 10]}
{[11 12 13 14 15 2 16 17 18 19 10]}

The output is a cell array, where each element corresponds to an observation. Each element is a row vector with elements representing the individual tokens in the corresponding observation including the padding values.

Convert the cell array to a numeric array by vertically concatenating the rows.

X = *2×11*
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 2 16 17 18 19 10

Convert the numeric indices to `dlarray`

. Because the rows and columns of `X`

correspond to observations and time steps, respectively, specify the format `'BT'`

.

dlX =
2(B) x 11(T) dlarray
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 2 16 17 18 19 10

Embed the numeric indices using the `embed`

function. The `embed`

function maps the padding tokens (tokens with index 0) and any other out-of-vocabulary tokens to the same out-of-vocabulary embedding vector.

In this case, the output is an `embeddingDimension`

-by-`N`

-by-`S`

matrix with format `'CBT'`

, where `N`

and `S`

are the number of observations and the number of time steps, respectively. The vector `dlY(:,n,t)`

corresponds to the embedding vector of time-step `t`

of observation `n`

.