Sub-Quadratic Architectures

The quest for sub-quadratic alternatives to the Transformer has produced a diverse family of architectures that trade the full pairwise attention computation for more efficient sequence mixing mechanisms. While state space models (Section 3.5) represent one major branch of this effort, several other approaches have demonstrated competitive performance with fundamentally different architectural principles.

Hyena

Poli et al. (2023) (Poli et al., 2023) introduced Hyena, which replaces attention with a hierarchy of data-controlled long convolutions and element-wise gating. The key insight is that attention's effectiveness comes from two properties: (1) global receptive field (every token can attend to every other token) and (2) data-dependence (the mixing pattern depends on the input). Standard convolutions have property (1) if the kernel is long enough but lack property (2): the same filter is applied regardless of input. Hyena achieves both properties by using data-dependent gating to modulate the output of long convolutions, effectively creating input-dependent filters without the quadratic cost of computing all pairwise interactions. Hyena achieves sub-quadratic O(n log n) complexity (via FFT-based convolution) while maintaining the in-context learning capability that pure convolutional models lack. On language modeling benchmarks, Hyena matches the quality of attention-based models, and its speed advantage over standard attention sets in once sequences pass roughly 8K tokens, widening as the sequences grow longer. Like other attention-free mixers, however, Hyena inherits the recall limitation of fixed-state models: tasks demanding precise associative recall and exact copying from arbitrary positions remain harder for it than for full attention, which retains every key-value pair explicitly.

RWKV

Peng et al. (2023) (Peng et al., 2023) proposed RWKV (Receptance Weighted Key Value), which combines the parallelizable training of Transformers with the efficient O(1)-per-token inference of RNNs. RWKV uses a linear attention variant called WKV (weighted key-value) that can be computed either as a parallel scan (during training, leveraging GPU parallelism) or as a recurrence (during inference, for constant-time per-token generation). The WKV mechanism uses exponential decay to weight past tokens, providing a learnable forgetting mechanism similar to gated RNNs but formulated to allow parallel computation. RWKV models up to 14B parameters have been trained and released as open-source, demonstrating that linear-complexity architectures can scale to sizes competitive with Transformer-based models like LLaMA (Touvron et al., 2023). Later iterations (Eagle/RWKV-5 and Finch/RWKV-6) extended the design with multi-headed matrix-valued states and a data-dependent recurrence that further narrows the gap with attention-based models (Peng et al., 2024). The cost of compressing history into a fixed-size state remains visible, though: like other fixed-state recurrent models, RWKV underperforms full attention on associative recall and precise copying or retrieval from earlier in the sequence, the same limitation that motivates hybrid designs such as BASED (discussed below) and the broader SSM-versus-attention debate in Section 3.5.

RetNet

Sun et al. (2023) (Sun et al., 2023) proposed RetNet (Retentive Network), which introduces a dual form of computation called multi-scale retention. Retention can be expressed in three equivalent forms: (1) a parallel representation for training (analogous to attention), (2) a recurrent representation for efficient inference (O(1) per token), and (3) a chunk-wise representation that balances parallelism and memory. The multi-scale aspect uses different decay rates across heads, allowing different heads to attend to different temporal scales: some heads focus on local context while others maintain longer-range dependencies. RetNet achieves competitive performance with Transformers on language modeling while offering O(n) training complexity and O(1) per-token inference, making it particularly attractive for deployment scenarios that prioritize inference efficiency. As with RWKV, the fixed-size retention state is the binding constraint on quality: RetNet must summarize unbounded history into a bounded representation, so it too trails full attention on tasks that require recovering exact tokens from arbitrary earlier positions.

BASED

Arora et al. (2024) (Arora et al., 2024) proposed BASED, which combines linear attention with sliding-window attention in a hybrid design. The key insight is that linear attention handles global context efficiently (O(n) complexity for summarizing the full sequence into a fixed-size state) while local sliding-window attention captures fine-grained local patterns (O(n * w) for window size w). The framing is explicitly a recall-throughput tradeoff: pure linear attention is fast but struggles at associative recall, so BASED's contribution is navigating the Pareto frontier between the two rather than maximizing either in isolation. By allocating a small number of attention heads to local sliding-window attention and the majority to global linear attention, BASED narrows the recall gap while preserving sub-quadratic overall complexity, reporting up to 24x higher generation throughput than FlashAttention-2 when decoding 1024 tokens with 1.3B-parameter models. The cost shows up on recall-intensive evaluation: BASED matches the strongest sub-quadratic models in overall perplexity yet still trails full attention on the recall-focused slice of language modeling, so a measurable quality gap remains in exchange for the throughput gains. This hybrid approach reflects a general principle: the information processing requirements of language are heterogeneous, with some patterns requiring precise local matching (syntax, coreference) and others requiring broad contextual awareness (topic, discourse).

Griffin and RecurrentGemma

De et al. (2024) (De et al., 2024) proposed Griffin, a hybrid architecture that combines gated linear recurrences with local attention. Griffin uses a recurrent block based on the Real-Gated Linear Recurrent Unit (RG-LRU) for global context aggregation and local multi-query attention for fine-grained token interactions. Google released RecurrentGemma models based on the Griffin architecture, demonstrating that hybrid recurrent-attention architectures can achieve comparable performance to pure Transformer models (Gemma) at the same scale while offering significantly faster inference for long sequences. Notably, Griffin reaches this parity only by retaining local attention: the recurrent block alone (the HAWK variant studied in the sibling state-space section) trails the hybrid on tasks that require precise token-to-token matching, underscoring that the gated recurrence by itself does not fully recover attention's recall behavior.

xLSTM

Beck et al. (2024) (Beck et al., 2024) revisited the LSTM architecture with xLSTM (Extended LSTM), introducing exponential gating and a novel memory mixing mechanism that addresses the classical LSTM's limited memory capacity. The xLSTM family includes two variants: sLSTM (scalar LSTM with exponential gating and new memory mixing) and mLSTM (matrix LSTM with a matrix-valued memory state and covariance update rule). The mLSTM variant is particularly notable: by replacing the vector-valued cell state with a matrix, it bridges the gap between recurrent models and attention: the matrix memory can be seen as a compressed version of the key-value cache in attention. xLSTM demonstrates that classical recurrent architectures, when modernized with insights from Transformers and SSMs, can be competitive with both on language modeling benchmarks up to 1.3B parameters. Because that matrix memory is still a fixed-size compression of the key-value cache rather than the full cache, xLSTM remains subject to the same recall ceiling as other recurrent mixers, and its strongest published evidence is at the relatively modest 1.3B scale rather than the multi-billion-parameter regime where attention-based models are most thoroughly validated.

The Convergent Design Landscape

A striking pattern emerges across these architectures: despite their different origins (convolutions, RNNs, linear attention, gated recurrences), they are converging toward similar design principles. All successful sub-quadratic architectures incorporate: (1) data-dependent gating or selection mechanisms (Mamba's selective scan, Hyena's data-controlled convolutions, xLSTM's exponential gating), (2) some form of multi-scale processing (different heads or channels operating at different temporal resolutions), and (3) hardware-aware implementation strategies (parallel scan algorithms, FlashAttention-inspired memory management). The most successful recent models are hybrids that combine sub-quadratic global processing with local attention (BASED, Griffin, Jamba (Lieber et al., 2024)), suggesting that a pure replacement for attention may be less important than finding the right combination of complementary mechanisms.