Signal Processing View of Transformers

Attention as Adaptive Filtering

Self-attention can be viewed as a form of adaptive filtering: given input tokens, attention computes data-dependent filter coefficients (attention weights) that determine how information from different positions is combined [@dong2021attention, @cordonnier2020relationship]. From this perspective:

Query-key dot products compute similarity, analogous to matched filtering: the query acts as a template that is correlated against all keys to find the best matches, exactly as in radar signal processing where a transmitted pulse is correlated against the received signal to detect targets.
Softmax normalization creates a probability distribution over positions, analogous to a normalized filter response. The temperature scaling (division by sqrt(d_k)) controls the sharpness of this distribution, analogous to the bandwidth of a bandpass filter: smaller temperature produces sharper, more selective attention (narrow-band filtering), while larger temperature produces smoother, more distributed attention (wide-band filtering).
Value aggregation is a weighted sum, analogous to applying the computed filter: the output is a linear combination of value vectors weighted by the attention pattern, identical to the output of an FIR filter with data-dependent coefficients.
Multi-head attention computes multiple independent attention patterns, analogous to a filter bank that decomposes the input into multiple frequency bands. Each head can specialize in different "frequency" ranges. Empirically, different heads in trained Transformers attend to different distance ranges, with some heads focusing on local patterns (high-frequency, fine-grained) and others on global patterns (low-frequency, coarse-grained) (Voita et al., 2019).

This interpretation connects attention to a rich literature on adaptive signal processing, including Wiener filters (optimal linear filters for stationary signals), Kalman filters (optimal recursive filters for non-stationary signals), and beamforming (spatial filtering for array signal processing) (Haykin, 2002). The key distinction is that attention is a non-causal, global adaptive filter: it can attend to future positions (in encoder self-attention) and has no fixed filter structure, making it more flexible but also more expensive than classical adaptive filters.

Positional Encoding as Frequency Representation

Sinusoidal positional encodings (Vaswani et al., 2017) represent position as a combination of sinusoids at different frequencies, directly analogous to a Fourier basis representation. The choice of frequency range determines the length scales the model can distinguish. Rotary Position Embedding (RoPE) (Su et al., 2024) extends this by applying rotation matrices to queries and keys, encoding relative position through the interference pattern of complex exponentials, a technique with close parallels to frequency-domain signal processing. The Long Range Arena benchmark (Tay et al., 2021) (see benchmarks.md for task details and sequence-length scales) has been instrumental in evaluating how different architectures handle long-range dependencies, revealing fundamental differences between attention-based, convolution-based, and recurrence-based approaches from a signal processing perspective.

Token Mixing as Convolution

Many efficient Transformer alternatives replace attention with fixed or learned convolution-like operations. From a signal processing perspective, these methods trade the adaptivity of attention (data-dependent filters) for the efficiency of convolution (shared, shift-invariant filters) [@leethorp2022fnet, @gu2024mamba]:

FNet (Lee-Thorp et al., 2022): Uses the DFT, a specific linear convolution in the frequency domain. By replacing attention with a 2D FFT (along both the sequence and hidden dimensions), FNet achieves 92% of BERT's accuracy on GLUE while training 80% faster. The DFT is a maximally efficient global mixing operation: it has O(n log n) complexity and mixes all positions equally (flat frequency response).
gMLP (Liu et al., 2021): Uses spatial gating units, a form of element-wise multiplication in a projected space. The gating mechanism can be viewed as a learned, channel-dependent filter that is applied across the spatial dimension, providing a middle ground between the fixed DFT of FNet and the fully adaptive attention of Transformers.
Hyena (Poli et al., 2023): Uses data-controlled long convolutions, bridging adaptive attention and fixed convolution. Hyena parameterizes implicit convolution kernels using a small neural network, enabling filters that are data-dependent (like attention) but structured as convolutions (like classical filters). On language modeling, Hyena matches the quality of attention-based Transformers at sequence lengths up to 8K while being significantly faster at longer sequences due to its O(n log n) FFT-based implementation.
State Space Models (S4, Mamba): The S4 architecture (Gu et al., 2022) and its successor Mamba (Gu & Dao, 2024) can be viewed through a signal processing lens as implementing a bank of IIR (infinite impulse response) filters. The continuous-time state space model dx/dt = Ax + Bu, y = Cx + Du is a classical linear time-invariant (LTI) system, and S4's key contribution is the HiPPO initialization of the A matrix, which produces filters that approximate optimal polynomial projections of the input history. After discretization, the resulting filters have long-range memory with polynomial decay (unlike the exponential decay of vanilla RNNs), enabling S4 to capture dependencies over thousands of time steps. Mamba makes the system parameters input-dependent (A, B, C become functions of the input), transforming the LTI system into a linear time-varying (LTV) system, the state space analog of moving from fixed convolution to adaptive attention.

This progression from fixed convolution to data-dependent filtering exactly mirrors the classical signal processing hierarchy from FIR filters to adaptive filters. The tradeoff is fundamental: fixed filters are efficient (O(n log n) via FFT) and well-understood theoretically, while adaptive filters are more expressive but more expensive (O(n^2) for full attention). The research trajectory from FNet through Hyena to Mamba represents a systematic exploration of this tradeoff, seeking architectures that are as expressive as attention but as efficient as convolution.

Randomized Attention Approximations

The connection between attention and random features has spawned a family of efficient attention mechanisms that use randomized algorithms from Section 5.4:

Performer (Choromanski et al., 2021) (Choromanski et al., 2021) replaces the softmax attention kernel with FAVOR+ (Fast Attention Via positive Orthogonal Random features). The key insight is that softmax attention can be decomposed as a kernel function: Attention(Q,K) = softmax(QK^T/sqrt(d)), and this normalized softmax kernel can be approximated using random features (via the Random Fourier Features framework of Rahimi and Recht (Rahimi & Recht, 2007)). FAVOR+ uses positive orthogonal random features to achieve O(n) attention computation (linear in sequence length) with provable approximation guarantees. Performer matches standard Transformer quality on many tasks while being 2-4x faster for long sequences.

Reformer (Kitaev et al., 2020) (Kitaev et al., 2020) uses locality-sensitive hashing (LSH) (from Section 5.5) to identify the most relevant key-value pairs for each query, reducing attention from O(n^2) to O(n log n). The idea is that for each query, only a small number of keys have significant attention weight (attention is typically sparse), and LSH can efficiently identify these high-attention keys without computing all dot products. Reformer also introduces reversible residual connections (from Section 5.3 of this chapter, connecting to memory-efficient computation) to reduce memory usage from O(nL) to O(n).

Random Feature Attention (RFA) (Peng et al., 2021) (Peng et al., 2021) uses random features to approximate causal attention, enabling linear-time autoregressive inference. Unlike Performer (which uses a fixed random feature map), RFA learns the random feature distribution, improving approximation quality for the specific attention patterns that arise in practice. Linear Attention (Katharopoulos et al., 2020) (Katharopoulos et al., 2020) provides the theoretical foundation for these approaches, showing that attention can be decomposed as phi(Q) * (phi(K)^T * V) for any feature map phi, enabling O(n) computation when the feature dimension is small. This decomposition is the bridge between the random features literature and efficient attention: choosing phi to be random Fourier features recovers an approximation to softmax attention.

These methods represent a direct application of the randomized algorithms toolkit (random projections, random features, LSH) to the central computational bottleneck of modern AI (attention), demonstrating the practical value of the theoretical foundations covered in this chapter. However, these approximations come with tradeoffs: linear attention methods (Performer, RFA) often degrade quality on recall-heavy tasks or when the original attention pattern is highly peaked, random feature approximations introduce variance that can hurt stability, and LSH bucketing in Reformer can create artifacts when similar queries hash to different buckets.

Attention as Adaptive Filtering​

Positional Encoding as Frequency Representation​

Token Mixing as Convolution​

Randomized Attention Approximations​

References

Attention as Adaptive Filtering

Positional Encoding as Frequency Representation

Token Mixing as Convolution

Randomized Attention Approximations