Skip to main content

Taxonomy of Approaches

Taxonomy of Approaches

Efficient architecture methods span a broad spectrum of techniques, from fundamental architectural innovations to engineering optimizations. We organize them along two distinct axes. The first axis classifies methods by model architecture: the design of the sequence-mixing and parameter-allocation primitives themselves (families 1 through 4). The second axis classifies methods by their position in the model lifecycle: how a model is adapted, designed, compressed, trained, or served, largely independent of the underlying architecture (families 5 through 9). These families are not mutually exclusive, and many state-of-the-art systems combine techniques from both axes.

Axis 1: Architectural families

  1. Efficient attention mechanisms: Reducing the O(n^2) cost of self-attention through sparse patterns (BigBird (Zaheer et al., 2020), Longformer (Beltagy et al., 2020)), linear approximations (Performer (Choromanski et al., 2021), Linear Attention (Katharopoulos et al., 2020)), IO-aware implementations (FlashAttention [@dao2022flashattention, @dao2023flashattention2]), and architectural modifications (Multi-Query Attention (Shazeer, 2019), Grouped-Query Attention (Ainslie et al., 2023), Differential Attention (Ye et al., 2024)). These methods preserve the Transformer architecture while making its core operation more efficient.

  2. State space models (SSMs): Sequence models derived from a continuous-time state-space parameterization, discretized into a structured linear recurrence (S4 (Gu et al., 2022), Mamba (Gu & Dao, 2024), Mamba-2 (Dao & Gu, 2024)) that achieves linear complexity in sequence length. We use the continuous-time state-space derivation as the demarcation criterion that separates this family from family 3: a model belongs here when its recurrence is obtained by discretizing a continuous dynamical system. SSMs represent the most radical departure from the Transformer paradigm, offering both computational and memory advantages for long sequences. Hybrid architectures that combine SSMs with attention (Jamba (Lieber et al., 2024), Zamba (Glorioso et al., 2024)) are emerging as a practical synthesis.

  3. Sub-quadratic architectures: Other alternatives to standard attention whose efficient form does not arise from a continuous-time state-space derivation, instead built from discrete gated recurrence or long convolution. This family includes data-controlled convolutions (Hyena (Poli et al., 2023)), gated linear RNN variants (RWKV (Peng et al., 2023), xLSTM (Beck et al., 2024), Griffin (De et al., 2024)), retention mechanisms (RetNet (Sun et al., 2023)), and hybrid designs (BASED (Arora et al., 2024)). These architectures explore the design space between pure attention and pure recurrence, often achieving competitive perplexity on language modeling while enabling O(n) inference. However, recurrent and sub-quadratic models can struggle with tasks requiring exact copying or in-context retrieval, where standard attention excels. The boundary with family 2 is one of derivation rather than empirical behavior, and several of these models are sometimes described as linear-attention or SSM-adjacent in the literature.

  4. Mixture of Experts (MoE): Conditional computation models that activate only a subset of parameters for each input, enabling much larger total model capacity without proportional compute increases (Switch Transformer (Fedus et al., 2022), Mixtral (Jiang et al., 2024), DeepSeek-MoE (Dai et al., 2024), GShard (Lepikhin et al., 2021)). MoE represents the principle of conditional computation: allocating compute where it is needed.

Axis 2: Lifecycle and deployment optimizations

  1. Parameter-efficient fine-tuning (PEFT): Methods that adapt large pre-trained models by modifying only a small fraction of parameters (LoRA (Hu et al., 2022), QLoRA (Dettmers et al., 2023), adapters (Houlsby et al., 2019), prompt tuning (Lester et al., 2021), DoRA (Liu et al., 2024), GaLore (Zhao et al., 2024)). PEFT reduces the cost of customization and enables new deployment patterns (multi-tenant serving with S-LoRA (Sheng et al., 2024), continual learning).

  2. Neural Architecture Search (NAS): Automated discovery of efficient architectures through search algorithms (DARTS (Liu et al., 2019), EfficientNet (Tan & Le, 2019), Once-for-All (Cai et al., 2020), MnasNet (Tan et al., 2019)). NAS replaces human design intuition with systematic optimization, potentially discovering architectures that human engineers would not consider.

  3. Model compression: Post-training techniques that reduce model size and compute cost, including knowledge distillation (Hinton et al., 2015) (DistilBERT (Sanh et al., 2019)), pruning (SparseGPT (Frantar & Alistarh, 2023), Wanda (Sun et al., 2024), Lottery Ticket Hypothesis (Frankle & Carlin, 2019)), and quantization (GPTQ (Frantar et al., 2023), AWQ (Lin et al., 2024), LLM.int8() (Dettmers et al., 2022), BitNet b1.58 (Ma et al., 2024)). Compression techniques are particularly important for deployment on resource-constrained hardware.

  4. Efficient training: Techniques for reducing the cost of training, including mixed-precision arithmetic (Micikevicius et al., 2018), gradient checkpointing (Chen et al., 2016), distributed training strategies (ZeRO (Rajbhandari et al., 2020), FSDP (Zhao et al., 2023)), tensor parallelism (Megatron-LM (Shoeybi et al., 2020)), and pipeline parallelism (GPipe (Huang et al., 2019)). These techniques do not change the model architecture but change how it is trained.

  5. Inference optimization: Techniques for faster and cheaper model serving, including speculative decoding [@leviathan2023fast, @chen2023accelerating], KV-cache management (PagedAttention (Kwon et al., 2023)), continuous batching (Orca (Yu et al., 2022)), and serving system design (vLLM (Kwon et al., 2023), SGLang (Zheng et al., 2024)). These techniques are critical for production deployment.

The table below summarizes the nine families along the axis they belong to, the primary resource each one targets, the central tradeoff it makes, and a representative method.

FamilyAxisPrimary targetCentral tradeoffRepresentative
Efficient attentionArchitectureAttention compute/memory (O(n^2))Exactness or generality vs. speedFlashAttention
State space modelsArchitectureSequence-length scalingArchitectural novelty vs. ecosystem maturityMamba
Sub-quadratic architecturesArchitectureInference cost (O(n))Exact retrieval/copying vs. cheaper recurrenceRWKV
Mixture of ExpertsArchitectureActive compute per tokenTotal parameters vs. active computeMixtral
PEFTLifecycleTrainable parametersAdaptation cost vs. full-model expressivenessLoRA
Neural Architecture SearchLifecycleDesign effortSearch cost vs. discovered efficiencyDARTS
Model compressionLifecycleModel size and precisionAccuracy vs. footprintGPTQ
Efficient trainingLifecycleTraining time and memoryMemory or communication vs. computeZeRO
Inference optimizationLifecycleServing latency and throughputMemory/scheduling complexity vs. throughputPagedAttention

These families are connected by several cross-cutting principles:

  • Low-rank structure: Many efficiency gains exploit the fact that the computations and representations in neural networks have low effective rank. LoRA exploits low-rank weight updates; linear attention exploits low-rank attention matrices; pruning exploits sparsity (a form of low-rank structure).
  • Conditional computation: Rather than applying all computation to all inputs, conditional computation allocates compute based on input difficulty or content. MoE is the most explicit form, but early exit, mixture-of-depths, and adaptive precision are related ideas.
  • Hardware-algorithm co-design: The most practical efficiency gains often come from designing algorithms that are aware of hardware constraints (memory hierarchy, parallelism, data movement). FlashAttention is the paradigmatic example.
  • Trading one resource for another: Many techniques trade off between different efficiency dimensions. Gradient checkpointing trades compute for memory; MoE trades parameters for compute; quantization trades precision for speed and memory.

References