Top-k Sparse Attention Layer

Top-k Sparse Attention Layer implemented based on Deep Learning Toolbox and customized deep learning layer template.

You are now following this Submission

The parameterized top‑K sparse attention mechanism is an efficient approximation of standard scaled dot‑product attention. For each query, it retains only the K largest similarity scores across all keys, setting the rest to . After softmax, this produces strictly sparse attention weights, reducing both memory and computational complexity from to . The hyperparameter K directly controls the sparsity–accuracy trade‑off.
In the implementation, sparsification relies on a hard threshold mask derived from the top‑K selection. This mask is treated as a constant during the forward pass, and gradients flow only through the selected scores—a formulation that follows the straight‑through estimator approach. This layer can be directly plugged into a dlnetwork. This layer is well‑suited to tasks involving long sequences where standard attention is prohibitively expensive, such as efficient Transformers or resource‑constrained time‑series forecasting.

Cite As

Chuguang Pan (2026). Top-k Sparse Attention Layer (https://www.mathworks.com/matlabcentral/fileexchange/184003-top-k-sparse-attention-layer), MATLAB Central File Exchange. Retrieved .

General Information

MATLAB Release Compatibility

  • Compatible with R2025a to R2026a

Platform Compatibility

  • Windows
  • macOS
  • Linux
Version Published Release Notes Action
1.0.0