# selfAttentionLayer

## Description

A self-attention layer computes single-head or multihead self-attention of its input.

The layer:

Computes the queries, keys, and values from the input

Computes the scaled dot-product attention across heads using the queries, keys, and values

Merges the results from the heads

Performs a linear transformation on the merged result

## Creation

### Syntax

### Description

creates a self-attention layer and sets the `layer`

= selfAttentionLayer(numHeads,numKeyChannels)`NumHeads`

and `NumKeyChannels`

properties.

sets the optional `layer`

= selfAttentionLayer(numHeads,numKeyChannels,`Name=Value`

)`NumValueChannels`

, `OutputSize`

, `HasPaddingMaskInput`

, `AttentionMask`

, `DropoutProbability`

, `HasScoresOutput`

, Parameters and Initialization, Learning Rate and Regularization, and `Name`

properties.

## Properties

### Self-Attention

`NumHeads`

— Number of attention heads

positive integer

This property is read-only.

Number of attention heads, specified as a positive integer that evenly divides
`NumKeyChannels`

.

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

`NumKeyChannels`

— Number of channels for keys and queries

positive integer

This property is read-only.

Number of channels for the keys and queries, specified as a positive integer that
is divisible by `NumHeads`

.

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

| `char`

| `string`

`NumValueChannels`

— Number of channels for values

`"auto"`

(default) | positive integer

Number of channels for the values, specified as one of these values:

`"auto"`

— Use`NumKeyChannels`

.Positive integer — Use the specified number of channels. This value must be divisible by

`NumHeads`

.

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

| `char`

| `string`

`OutputSize`

— Number of channels of layer output

`"auto"`

(default) | positive integer

Number of channels of the layer output, specified as one of these values:

`"auto"`

— Use the number of channels in the layer input.Positive integer — Use the specified number of channels.

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

| `char`

| `string`

`HasPaddingMaskInput`

— Flag indicating whether layer has mask input

`0`

(`false`

) (default) | `1`

(`true`

)

Flag indicating whether the layer has an input that represents the padding mask,
specified as `0`

(`false`

) or `1`

(`true`

).

If the `HasPaddingMaskInput`

property is `0`

(`false`

), then the layer has one input with the name
`"in"`

, which corresponds to the input data. In this case, the layer
treats all elements as data.

If the `HasPaddingMaskInput`

property is `1`

(`true`

), then the layer has two inputs with the names
`"in"`

and `"mask"`

, which correspond to the input
data and the mask, respectively. In this case, the padding mask is an array of ones and
zeros. The layer uses and ignores elements of the input when the corresponding element in
the mask is one or zero, respectively.

The format of the padding mask must match that of the input. The size of the `"S"`

(spatial), `"T"`

(time), and `"B"`

(batch) dimensions of the padding mask must match the size of the corresponding dimensions in the input.

The padding mask can have any number of channels. The software uses the values in the first channel only to indicate padding values.

`AttentionMask`

— Mask preventing attention to elements in key-value pairs

`"none"`

(default) | `"causal"`

Mask preventing attention to elements in key-value pairs, specified as one of these values:

`"none"`

— Do not prevent attention to elements based on their positions. If`HasPaddingMaskInput`

is`1`

(`true`

), then the layer prevents attention to padding elements only.`"causal"`

— Prevent elements in position`M`

from attending to elements in position`N`

, where`N`

is greater than`M`

. Use this option for auto-regressive models.

`DropoutProbability`

— Dropout probability for attention scores

`0`

(default) | scalar in the range [0, 1)

Probability of dropping out attention scores, specified as a scalar in the range [0, 1).

During training, the software randomly sets values in the attention scores to zero using the specified probability. These dropouts can encourage the model to learn more robust and generalizable representations by preventing it from relying too heavily on specific dependencies.

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

`HasScoresOutput`

— Flag indicating whether layer has scores output

`0`

(`false`

) (default) | `1`

(`true`

)

Flag indicating whether the layer has an output that represents the scores (also known as the
attention weights), specified as `0`

(`false`

) or
`1`

(`true`

).

If the `HasScoresOutput`

property is `0`

(`false`

), then the layer has one output with the name
`"out"`

, which corresponds to the output data.

If the `HasScoresOutput`

property is `1`

(`true`

), then the layer has two inputs with the names
`"out"`

and `"scores"`

, which correspond to the output
data and the attention scores, respectively.

`InputSize`

— Number of input channels

`"auto"`

(default) | positive integer

This property is read-only.

Number of input channels, specified as one of these values:

`"auto"`

— Automatically determine the number of input channels when you initialize the networkPositive integer — Configure the layer for the specified number of input channels.

`InputSize`

and the number of channels in the layer input data must match.

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

| `char`

| `string`

### Parameters and Initialization

`WeightsInitializer`

— Function to initialize weights

`"glorot"`

(default) | `"he"`

| `"narrow-normal"`

| `"zeros"`

| `"ones"`

| function handle

Function to initialize the query, key, value, and output weights, specified as one of these values:

`"glorot"`

– Initialize the weights with the Glorot initializer (also known as Xavier initializer) [2]. The Glorot initializer independently samples from a uniform distribution with zero mean and a variance of`2/(numIn + numOut)`

. The values of`numIn`

and`numOut`

depend on the weight matrix:Weight `numIn`

`numOut`

Query `InputSize`

`NumKeyChannels`

Key `InputSize`

`NumKeyChannels`

Value `InputSize`

`NumValueChannels`

Output `NumValueChannels`

`OutputSize`

`"he"`

– Initialize the weights with the He initializer [3]. The He initializer samples from a normal distribution with zero mean and a variance of`2/numIn`

. The values of`numIn`

and`numOut`

depend on the weight matrix:Weight `numIn`

`numOut`

Query `InputSize`

`NumKeyChannels`

Key `InputSize`

`NumKeyChannels`

Value `InputSize`

`NumValueChannels`

Output `NumValueChannels`

`OutputSize`

`"narrow-normal"`

— Initialize the weights by independently sampling from a normal distribution with zero mean and a standard deviation of 0.01.`"zeros"`

— Initialize the weights with zeros.`"ones"`

— Initialize the weights with ones.Function handle — Initialize the weights with a custom function. If you specify a function handle, then the function must have the form

`weights = func(sz)`

, where`sz`

is the size of the weights. For an example, see Specify Custom Weight Initialization Function.

The layer only initializes the weights when the corresponding weights property is empty.

**Data Types: **`char`

| `string`

| `function_handle`

`BiasInitializer`

— Function to initialize biases

`"zeros"`

(default) | `"narrow-normal"`

| `"ones"`

| function handle

Function to initialize the query, key, value, and output biases, specified as one of these values:

`"zeros"`

— Initialize the biases with zeros.`"ones"`

— Initialize the biases with ones.`"narrow-normal"`

— Initialize the biases by independently sampling from a normal distribution with zero mean and a standard deviation of 0.01.Function handle — Initialize the biases with a custom function. If you specify a function handle, then the function must have the form

`bias = func(sz)`

, where`sz`

is the size of the biases.

The layer only initializes the biases when the corresponding bias property is empty.

**Data Types: **`char`

| `string`

| `function_handle`

`QueryWeights`

— Query weights

`[]`

(default) | matrix

Query weights, specified as a `NumKeyChannels`

-by-`numInputChannels`

matrix or
`[]`

, where `numInputChannels`

is the number of
channels in the layer input.

**Data Types: **`single`

| `double`

`KeyWeights`

— Key weights

`[]`

(default) | matrix

Key weights, specified as a `NumKeyChannels`

-by-`numInputChannels`

matrix or
`[]`

, where `numInputChannels`

is the number of
channels in the layer input.

**Data Types: **`single`

| `double`

`ValueWeights`

— Value weights

`[]`

(default) | matrix

Value weights, specified as a `NumValueChannels`

-by-`numInputChannels`

matrix or
`[]`

, where `numInputChannels`

is the number of
channels in the layer input.

**Data Types: **`single`

| `double`

`OutputWeights`

— Output weights

`[]`

(default) | matrix

Output weights, specified as an `OutputSize`

-by-`NumValueChannels`

matrix or `[]`

.

**Data Types: **`single`

| `double`

`QueryBias`

— Query biases

`[]`

(default) | vector

Query biases, specified as a `NumKeyChannels`

-by-`1`

vector or
`[]`

.

**Data Types: **`single`

| `double`

`KeyBias`

— Key biases

`[]`

(default) | vector

Key biases, specified as a `NumKeyChannels`

-by-`1`

vector or
`[]`

.

**Data Types: **`single`

| `double`

`ValueBias`

— Value biases

`[]`

(default) | vector

Value biases, specified as a `NumValueChannels`

-by-`1`

vector or
`[]`

.

**Data Types: **`single`

| `double`

`OutputBias`

— Output biases

`[]`

(default) | vector

Output biases, specified as an `OutputSize`

-by-`1`

vector or
`[]`

.

**Data Types: **`single`

| `double`

### Learning Rate and Regularization

`WeightLearnRateFactor`

— Learning rate factor for weights

`1`

(default) | nonnegative scalar

Learning rate factor for the query, key, value, and output weights, specified as a nonnegative scalar.

The software multiplies this factor by the global learning rate to determine the learning rate for the weights in this layer. For example, if `WeightLearnRateFactor`

is `2`

, then the learning rate for the weights in this layer is twice the current global learning rate. The software determines the global learning rate based on the settings you specify using the `trainingOptions`

function.

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

`BiasLearnRateFactor`

— Learning rate factor for biases

`1`

(default) | nonnegative scalar

Learning rate factor for the query, key, value, and output biases, specified as a nonnegative scalar.

The software multiplies this factor by the global learning rate to determine the learning rate for the biases in this layer. For example, if `BiasLearnRateFactor`

is `2`

, then the learning rate for the biases in the layer is twice the current global learning rate. The software determines the global learning rate based on the settings you specify using the `trainingOptions`

function.

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

`WeightL2Factor`

— *L*_{2} regularization factor for weights

`1`

(default) | nonnegative scalar

_{2}

*L _{2}* regularization factor for the query,
key, value, and output weights, specified as a nonnegative scalar.

The software multiplies this factor by the global *L _{2}* regularization factor to determine the

*L*regularization for the weights in this layer. For example, if

_{2}`WeightL2Factor`

is `2`

, then the *L*regularization for the weights in this layer is twice the global

_{2}*L*regularization factor. You can specify the global

_{2}*L*regularization factor using the

_{2}`trainingOptions`

function.**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

`BiasL2Factor`

— *L*_{2} regularization factor for biases

`0`

(default) | nonnegative scalar

_{2}

*L _{2}* regularization factor for the query,
key, value, and output biases, specified as a nonnegative scalar.

The software multiplies this factor by the global *L _{2}* regularization factor to determine the

*L*regularization for the biases in this layer. For example, if

_{2}`BiasL2Factor`

is `2`

, then the *L*regularization for the biases in this layer is twice the global

_{2}*L*regularization factor. The software determines the global

_{2}*L*regularization factor based on the settings you specify using the

_{2}`trainingOptions`

function.**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

### Layer

`Name`

— Layer name

`""`

(default) | character vector | string scalar

`NumInputs`

— Number of inputs

`1`

| `2`

Number of inputs to the layer, returned as `1`

or
`2`

.

If the `HasPaddingMaskInput`

property is `0`

(`false`

), then the layer has one input with the name
`"in"`

, which corresponds to the input data. In this case, the layer
treats all elements as data.

If the `HasPaddingMaskInput`

property is `1`

(`true`

), then the layer has two inputs with the names
`"in"`

and `"mask"`

, which correspond to the input
data and the mask, respectively. In this case, the padding mask is an array of ones and
zeros. The layer uses and ignores elements of the input when the corresponding element in
the mask is one or zero, respectively.

The format of the padding mask must match that of the input. The size of the `"S"`

(spatial), `"T"`

(time), and `"B"`

(batch) dimensions of the padding mask must match the size of the corresponding dimensions in the input.

The padding mask can have any number of channels. The software uses the values in the first channel only to indicate padding values.

**Data Types: **`double`

`InputNames`

— Input names

`"in"`

| `["in" "mask"]`

Input names of the layer, returned as a cell array of character vectors.

If the `HasPaddingMaskInput`

property is `0`

(`false`

), then the layer has one input with the name
`"in"`

, which corresponds to the input data. In this case, the layer
treats all elements as data.

If the `HasPaddingMaskInput`

property is `1`

(`true`

), then the layer has two inputs with the names
`"in"`

and `"mask"`

, which correspond to the input
data and the mask, respectively. In this case, the padding mask is an array of ones and
zeros. The layer uses and ignores elements of the input when the corresponding element in
the mask is one or zero, respectively.

The format of the padding mask must match that of the input. The size of the `"S"`

(spatial), `"T"`

(time), and `"B"`

(batch) dimensions of the padding mask must match the size of the corresponding dimensions in the input.

The padding mask can have any number of channels. The software uses the values in the first channel only to indicate padding values.

The `SelfAttentionLayer`

object stores this property as a cell array of character
vectors.

`NumOutputs`

— Number of outputs

`1`

(default) | `2`

This property is read-only.

Number of outputs of the layer.

If the `HasScoresOutput`

property is `0`

(`false`

), then the layer has one output with the name
`"out"`

, which corresponds to the output data.

If the `HasScoresOutput`

property is `1`

(`true`

), then the layer has two inputs with the names
`"out"`

and `"scores"`

, which correspond to the output
data and the attention scores, respectively.

**Data Types: **`double`

`OutputNames`

— Output names

`"out"`

(default) | `["out" "scores"]`

This property is read-only.

Output names of the layer.

If the `HasScoresOutput`

property is `0`

(`false`

), then the layer has one output with the name
`"out"`

, which corresponds to the output data.

If the `HasScoresOutput`

property is `1`

(`true`

), then the layer has two inputs with the names
`"out"`

and `"scores"`

, which correspond to the output
data and the attention scores, respectively.

The `SelfAttentionLayer`

object stores this property as a cell array of character
vectors.

## Examples

### Create Self-Attention Layer

Create a self-attention layer with eight heads and 256 key and query channels.

layer = selfAttentionLayer(8,256)

layer = SelfAttentionLayer with properties: Name: '' AttentionMask: 'none' HasPaddingMaskInput: 0 HasScoresOutput: 0 Hyperparameters InputSize: 'auto' NumHeads: 8 NumKeyChannels: 256 NumValueChannels: 'auto' OutputSize: 'auto' DropoutProbability: 0 Learnable Parameters QueryWeights: [] KeyWeights: [] ValueWeights: [] OutputWeights: [] QueryBias: [] KeyBias: [] ValueBias: [] OutputBias: [] Use properties method to see a list of all properties.

Include a self-attention layer in a layer array.

layers = [ sequenceInputLayer(12) selfAttentionLayer(4,12) layerNormalizationLayer fullyConnectedLayer(9) softmaxLayer];

## Algorithms

### Dot-Product Attention

The attention operation focuses on parts of the input using weighted multiplication operations.

The single-head dot-product attention operation is given by

$$\text{attention}(Q,K,V)=\text{dropout}\left(\text{softmax}\left(\text{mask}\left(\lambda Q{K}^{\top},M\right)\right),p\right)V,$$

where:

*Q*denotes the queries.*K*denotes the keys.*V*denotes the values.$$\lambda $$ denotes the scaling factor.

*M*is a mask array of ones and zeros.*p*is the dropout probability.

The mask operation includes or excludes the values of the matrix multiplication setting values
of the input to $$-\infty $$ for zero-valued mask elements. The mask is the union of the padding and
attention masks. The softmax function normalizes the value of the input data across the
channel dimension such that it sums to one. The dropout operation sets elements to zero with
probability *p*.

### Multihead Self-Attention

The multihead self-attention operation for the input *X* is given by

$$\text{multiheadSelfAttention}(X,h,{W}^{Q},{W}^{K},{W}^{V},{W}^{O})=\text{concatenate}({\text{head}}_{1},\dots ,{\text{head}}_{h}){W}^{O},$$

where:

*h*is the number of heads.*W*is a learnable projection matrix for the queries.^{Q}*W*is a learnable projection matrix for the keys.^{K}*W*is a learnable projection matrix for the values.^{V}*W*is a learnable projection matrix for the output.^{O}

Each weight matrix is composed of concatenated weight matrices *W _{i}* for each head. Each $${\text{head}}_{i}$$ denotes the output of the head operation given by

$${\text{head}}_{i}=\text{selfAttention}\left(X{W}_{i}^{Q},X{W}_{i}^{K},X{W}_{i}^{V}\right).$$

### Layer Input and Output Formats

Layers in a layer array or layer graph pass data to subsequent layers as formatted `dlarray`

objects.
The format of a `dlarray`

object is a string of characters, in which each
character describes the corresponding dimension of the data. The formats consist of one or
more of these characters:

`"S"`

— Spatial`"C"`

— Channel`"B"`

— Batch`"T"`

— Time`"U"`

— Unspecified

For example, 2-D image data that is represented as a 4-D array, where the first two dimensions
correspond to the spatial dimensions of the images, the third dimension corresponds to the
channels of the images, and the fourth dimension corresponds to the batch dimension, can be
described as having the format `"SSCB"`

(spatial, spatial, channel,
batch).

You can interact with these `dlarray`

objects in automatic differentiation
workflows, such as those for developing a custom layer, using a `functionLayer`

object, or using the `forward`

and `predict`

functions with
`dlnetwork`

objects.

This table shows the supported input formats of `SelfAttentionLayer`

objects and the
corresponding output format. If the software passes the output of the layer to a custom
layer that does not inherit from the `nnet.layer.Formattable`

class, or a
`FunctionLayer`

object with the `Formattable`

property
set to `0`

(`false`

), then the layer receives an
unformatted `dlarray`

object with dimensions ordered according to the formats
in this table. The formats listed here are only a subset. The layer may support additional
formats such as formats with additional `"S"`

(spatial) or
`"U"`

(unspecified) dimensions.

Input Format | Output Format |
---|---|

`"CB"` (channel, batch) | `"CB"` (channel, batch) |

`"SCB"` (spatial, channel, batch) | `"SCB"` (spatial, channel, batch) |

`"CBT"` (channel, batch, time) | `"CBT"` (channel, batch, time) |

`"SC"` (spatial, channel) | `"SC"` (spatial, channel) |

`"CT"` (channel, time) | `"CT"` (channel, time) |

`"SB"` (spatial, batch) | `"SCB"` (spatial, channel, batch) |

`"BT"` (batch, time) | `"CBT"` (channel, batch, time) |

## References

[1] Vaswani,
Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser,
and Illia Polosukhin. "Attention is all you need." In *Advances in Neural Information
Processing Systems*, Vol. 30. Curran Associates, Inc., 2017. https://papers.nips.cc/paper/7181-attention-is-all-you-need.

[2] Glorot,
Xavier, and Yoshua Bengio. "Understanding the Difficulty of Training Deep Feedforward Neural
Networks." In *Proceedings of the Thirteenth International Conference on Artificial
Intelligence and Statistics*, 249–356. Sardinia, Italy: AISTATS, 2010. https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf

[3] He, Kaiming,
Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Delving Deep into Rectifiers: Surpassing Human-Level
Performance on ImageNet Classification." In *2015 IEEE International Conference on
Computer Vision (ICCV)*, 1026–34. Santiago, Chile: IEEE, 2015. https://doi.org/10.1109/ICCV.2015.123

## Extended Capabilities

### C/C++ Code Generation

Generate C and C++ code using MATLAB® Coder™.

Usage notes and limitations:

Code generation is not supported when

`HasScoresOutput`

is set to`true`

.

### GPU Code Generation

Generate CUDA® code for NVIDIA® GPUs using GPU Coder™.

Refer to the usage notes and limitations in the C/C++ Code Generation section. Same limitations apply to the GPU Code Generation.

## Version History

**Introduced in R2023a**

## MATLAB Command

You clicked a link that corresponds to this MATLAB command:

Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list:

## How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

### Americas

- América Latina (Español)
- Canada (English)
- United States (English)

### Europe

- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)

- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)