Skip to main content

Taxonomy of Approaches

Taxonomy of Approaches

Efficient architecture methods span a broad spectrum of techniques, from fundamental architectural innovations to engineering optimizations. We organize them into the following families, which are not mutually exclusive -- many state-of-the-art systems combine techniques from multiple families:

  1. Efficient attention mechanisms: Reducing the O(n^2) cost of self-attention through sparse patterns (BigBird (Zaheer et al., 2020), Longformer (Beltagy et al., 2020)), linear approximations (Performer (Choromanski et al., 2021), Linear Attention (Katharopoulos et al., 2020)), IO-aware implementations (FlashAttention [@dao2022flashattention, @dao2023flashattention2]), and architectural modifications (Multi-Query Attention (Shazeer, 2019), Grouped-Query Attention (Ainslie et al., 2023), Differential Attention (Ye et al., 2024)). These methods preserve the Transformer architecture while making its core operation more efficient.

  2. State space models (SSMs): Fundamentally different sequence models based on continuous-time dynamical systems (S4 (Gu et al., 2022), Mamba (Gu & Dao, 2024), Mamba-2 (Dao & Gu, 2024)) that achieve linear complexity in sequence length. SSMs represent the most radical departure from the Transformer paradigm, offering both computational and memory advantages for long sequences. Hybrid architectures that combine SSMs with attention (Jamba (Lieber et al., 2024), Zamba (Glorioso et al., 2024)) are emerging as a practical synthesis.

  3. Sub-quadratic architectures: Other alternatives to standard attention, including data-controlled convolutions (Hyena (Poli et al., 2023)), linear RNN variants (RWKV (Peng et al., 2023), xLSTM (Beck et al., 2024), Griffin (De et al., 2024)), retention mechanisms (RetNet (Sun et al., 2023)), and hybrid designs (BASED (Arora et al., 2024)). These architectures explore the design space between pure attention and pure recurrence, often achieving competitive perplexity on language modeling while enabling O(n) inference.

  4. Mixture of Experts (MoE): Conditional computation models that activate only a subset of parameters for each input, enabling much larger total model capacity without proportional compute increases (Switch Transformer (Fedus et al., 2022), Mixtral (Jiang et al., 2024), DeepSeek-MoE (Dai et al., 2024), GShard (Lepikhin et al., 2021)). MoE represents the principle of conditional computation -- allocating compute where it is needed.

  5. Parameter-efficient fine-tuning (PEFT): Methods that adapt large pre-trained models by modifying only a small fraction of parameters (LoRA (Hu et al., 2022), QLoRA (Dettmers et al., 2023), adapters (Houlsby et al., 2019), prompt tuning (Lester et al., 2021), DoRA (Liu et al., 2024), GaLore (Zhao et al., 2024)). PEFT reduces the cost of customization and enables new deployment patterns (multi-tenant serving with S-LoRA (Sheng et al., 2024), continual learning).

  6. Neural Architecture Search (NAS): Automated discovery of efficient architectures through search algorithms (DARTS (Liu et al., 2019), EfficientNet (Tan & Le, 2019), Once-for-All (Cai et al., 2020), MnasNet (Tan et al., 2019)). NAS replaces human design intuition with systematic optimization, potentially discovering architectures that human engineers would not consider.

  7. Model compression: Post-training techniques that reduce model size and compute cost, including knowledge distillation (Hinton et al., 2015) (DistilBERT (Sanh et al., 2019)), pruning (SparseGPT (Frantar & Alistarh, 2023), Wanda (Sun et al., 2024), Lottery Ticket Hypothesis (Frankle & Carlin, 2019)), and quantization (GPTQ (Frantar et al., 2023), AWQ (Lin et al., 2024), LLM.int8() (Dettmers et al., 2022), BitNet b1.58 (Ma et al., 2024)). Compression techniques are particularly important for deployment on resource-constrained hardware.

  8. Efficient training: Techniques for reducing the cost of training, including mixed-precision arithmetic (Micikevicius et al., 2018), gradient checkpointing (Chen et al., 2016), distributed training strategies (ZeRO (Rajbhandari et al., 2020), FSDP (Zhao et al., 2023)), tensor parallelism (Megatron-LM (Shoeybi et al., 2020)), and pipeline parallelism (GPipe (Huang et al., 2019)). These techniques do not change the model architecture but change how it is trained.

  9. Inference optimization: Techniques for faster and cheaper model serving, including speculative decoding [@leviathan2023fast, @chen2023accelerating], KV-cache management (PagedAttention (Kwon et al., 2023)), continuous batching, and serving system design (vLLM, SGLang (Zheng et al., 2024)). These techniques are critical for production deployment.

These families are connected by several cross-cutting principles:

  • Low-rank structure: Many efficiency gains exploit the fact that the computations and representations in neural networks have low effective rank. LoRA exploits low-rank weight updates; linear attention exploits low-rank attention matrices; pruning exploits sparsity (a form of low-rank structure).
  • Conditional computation: Rather than applying all computation to all inputs, conditional computation allocates compute based on input difficulty or content. MoE is the most explicit form, but early exit, mixture-of-depths, and adaptive precision are related ideas.
  • Hardware-algorithm co-design: The most practical efficiency gains often come from designing algorithms that are aware of hardware constraints (memory hierarchy, parallelism, data movement). FlashAttention is the paradigmatic example.
  • Trading one resource for another: Many techniques trade off between different efficiency dimensions. Gradient checkpointing trades compute for memory; MoE trades parameters for compute; quantization trades precision for speed and memory.

References