Taxonomy of Approaches
Taxonomy of Approaches
Efficient architecture methods span a broad spectrum of techniques, from fundamental architectural innovations to engineering optimizations. We organize them into the following families, which are not mutually exclusive -- many state-of-the-art systems combine techniques from multiple families:
-
Efficient attention mechanisms: Reducing the O(n^2) cost of self-attention through sparse patterns (BigBird (Zaheer et al., 2020), Longformer (Beltagy et al., 2020)), linear approximations (Performer (Choromanski et al., 2021), Linear Attention (Katharopoulos et al., 2020)), IO-aware implementations (FlashAttention [@dao2022flashattention, @dao2023flashattention2]), and architectural modifications (Multi-Query Attention (Shazeer, 2019), Grouped-Query Attention (Ainslie et al., 2023), Differential Attention (Ye et al., 2024)). These methods preserve the Transformer architecture while making its core operation more efficient.
-
State space models (SSMs): Fundamentally different sequence models based on continuous-time dynamical systems (S4 (Gu et al., 2022), Mamba (Gu & Dao, 2024), Mamba-2 (Dao & Gu, 2024)) that achieve linear complexity in sequence length. SSMs represent the most radical departure from the Transformer paradigm, offering both computational and memory advantages for long sequences. Hybrid architectures that combine SSMs with attention (Jamba (Lieber et al., 2024), Zamba (Glorioso et al., 2024)) are emerging as a practical synthesis.
-
Sub-quadratic architectures: Other alternatives to standard attention, including data-controlled convolutions (Hyena (Poli et al., 2023)), linear RNN variants (RWKV (Peng et al., 2023), xLSTM (Beck et al., 2024), Griffin (De et al., 2024)), retention mechanisms (RetNet (Sun et al., 2023)), and hybrid designs (BASED (Arora et al., 2024)). These architectures explore the design space between pure attention and pure recurrence, often achieving competitive perplexity on language modeling while enabling O(n) inference.
-
Mixture of Experts (MoE): Conditional computation models that activate only a subset of parameters for each input, enabling much larger total model capacity without proportional compute increases (Switch Transformer (Fedus et al., 2022), Mixtral (Jiang et al., 2024), DeepSeek-MoE (Dai et al., 2024), GShard (Lepikhin et al., 2021)). MoE represents the principle of conditional computation -- allocating compute where it is needed.
-
Parameter-efficient fine-tuning (PEFT): Methods that adapt large pre-trained models by modifying only a small fraction of parameters (LoRA (Hu et al., 2022), QLoRA (Dettmers et al., 2023), adapters (Houlsby et al., 2019), prompt tuning (Lester et al., 2021), DoRA (Liu et al., 2024), GaLore (Zhao et al., 2024)). PEFT reduces the cost of customization and enables new deployment patterns (multi-tenant serving with S-LoRA (Sheng et al., 2024), continual learning).
-
Neural Architecture Search (NAS): Automated discovery of efficient architectures through search algorithms (DARTS (Liu et al., 2019), EfficientNet (Tan & Le, 2019), Once-for-All (Cai et al., 2020), MnasNet (Tan et al., 2019)). NAS replaces human design intuition with systematic optimization, potentially discovering architectures that human engineers would not consider.
-
Model compression: Post-training techniques that reduce model size and compute cost, including knowledge distillation (Hinton et al., 2015) (DistilBERT (Sanh et al., 2019)), pruning (SparseGPT (Frantar & Alistarh, 2023), Wanda (Sun et al., 2024), Lottery Ticket Hypothesis (Frankle & Carlin, 2019)), and quantization (GPTQ (Frantar et al., 2023), AWQ (Lin et al., 2024), LLM.int8() (Dettmers et al., 2022), BitNet b1.58 (Ma et al., 2024)). Compression techniques are particularly important for deployment on resource-constrained hardware.
-
Efficient training: Techniques for reducing the cost of training, including mixed-precision arithmetic (Micikevicius et al., 2018), gradient checkpointing (Chen et al., 2016), distributed training strategies (ZeRO (Rajbhandari et al., 2020), FSDP (Zhao et al., 2023)), tensor parallelism (Megatron-LM (Shoeybi et al., 2020)), and pipeline parallelism (GPipe (Huang et al., 2019)). These techniques do not change the model architecture but change how it is trained.
-
Inference optimization: Techniques for faster and cheaper model serving, including speculative decoding [@leviathan2023fast, @chen2023accelerating], KV-cache management (PagedAttention (Kwon et al., 2023)), continuous batching, and serving system design (vLLM, SGLang (Zheng et al., 2024)). These techniques are critical for production deployment.
These families are connected by several cross-cutting principles:
- Low-rank structure: Many efficiency gains exploit the fact that the computations and representations in neural networks have low effective rank. LoRA exploits low-rank weight updates; linear attention exploits low-rank attention matrices; pruning exploits sparsity (a form of low-rank structure).
- Conditional computation: Rather than applying all computation to all inputs, conditional computation allocates compute based on input difficulty or content. MoE is the most explicit form, but early exit, mixture-of-depths, and adaptive precision are related ideas.
- Hardware-algorithm co-design: The most practical efficiency gains often come from designing algorithms that are aware of hardware constraints (memory hierarchy, parallelism, data movement). FlashAttention is the paradigmatic example.
- Trading one resource for another: Many techniques trade off between different efficiency dimensions. Gradient checkpointing trades compute for memory; MoE trades parameters for compute; quantization trades precision for speed and memory.
References
- Joshua Ainslie, James Lee-Thorp, Michiel de Jong (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. EMNLP.
- Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, Christopher Ré (2024). Simple linear attention language models balance the recall-throughput tradeoff. ICML.
- Maximilian Beck, Korbinian Poppel, Markus Spanring (2024). xLSTM: Extended Long Short-Term Memory. NeurIPS.
- Iz Beltagy, Matthew E. Peters, Arman Cohan (2020). Longformer: The Long-Document Transformer. arXiv.
- Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han (2020). Once-for-All: Train One Network and Specialize it for Efficient Deployment. ICLR.
- Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin (2016). Training Deep Nets with Sublinear Memory Cost. arXiv.
- Krzysztof Choromanski, Valerii Likhosherstov, David Dohan (2021). Rethinking Attention with Performers. ICLR.
- Damai Dai, Chengqi Deng, Chenggang Zhao (2024). DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. arXiv.
- Tri Dao, Albert Gu (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. ICML.
- Soham De, Samuel L. Smith, Anushan Fernando (2024). Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models. arXiv.
- Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer (2022). GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale. NeurIPS.
- Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer (2023). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS.
- William Fedus, Barret Zoph, Noam Shazeer (2022). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. JMLR.
- Jonathan Frankle, Michael Carlin (2019). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. ICLR.
- Elias Frantar, Dan Alistarh (2023). SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. ICML.
- Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers. ICLR.
- Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Beren Millidge (2024). Zamba: A Compact 7B SSM Hybrid Model. arXiv.
- Albert Gu, Karan Goel, Christopher Re (2022). Efficiently Modeling Long Sequences with Structured State Spaces. ICLR.
- Albert Gu, Tri Dao (2024). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv.
- Geoffrey Hinton, Oriol Vinyals, Jeff Dean (2015). Distilling the Knowledge in a Neural Network. NeurIPS Workshop.
- Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, Sylvain Gelly (2019). Parameter-Efficient Transfer Learning for NLP. ICML.
- Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR.
- Yanping Huang, Youlong Cheng, Ankur Bapna (2019). GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. NeurIPS.
- Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux (2024). Mixtral of Experts. arXiv.
- Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, Francois Fleuret (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML.
- Woosuk Kwon, Zhuohan Li, Siyuan Zhuang (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP.
- Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu (2021). GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. ICLR.
- Brian Lester, Rami Al-Rfou, Noah Constant (2021). The Power of Scale for Parameter-Efficient Prompt Tuning. EMNLP.
- Opher Lieber, Barak Lenz, Hofit Bata (2024). Jamba: A Hybrid Transformer-Mamba Language Model. arXiv.
- Ji Lin, Jiaming Tang, Haotian Tang (2024). AWQ: Activation-Aware Weight Quantization for On-Device LLM Compression and Acceleration. MLSys.
- Hanxiao Liu, Karen Simonyan, Yiming Yang (2019). DARTS: Differentiable Architecture Search. ICLR.
- Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Min-Hung Chen (2024). DoRA: Weight-Decomposed Low-Rank Adaptation. ICML.
- Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei (2024). The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. arXiv.
- Paulius Micikevicius, Sharan Narang, Jonah Alben (2018). Mixed Precision Training. ICLR.
- Bo Peng, Eric Alcaide, Quentin Anthony (2023). RWKV: Reinventing RNNs for the Transformer Era. EMNLP Findings.
- Michael Poli, Stefano Massaroli, Eric Nguyen (2023). Hyena Hierarchy: Towards Larger Convolutional Language Models. ICML.
- Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He (2020). ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. SC.
- Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf (2019). DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. EMC2 Workshop at NeurIPS.
- Noam Shazeer (2019). Fast Transformer Decoding: One Write-Head is All You Need. arXiv.
- Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica (2024). S-LoRA: Serving Thousands of Concurrent LoRA Adapters. MLSys.
- Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro (2020). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv.
- Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, Furu Wei (2023). Retentive Network: A Successor to Transformer for Large Language Models. arXiv.
- Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter (2024). A Simple and Effective Pruning Approach for Large Language Models. ICLR.
- Mingxing Tan, Quoc V. Le (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ICML.
- Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, Quoc V. Le (2019). MnasNet: Platform-Aware Neural Architecture Search for Mobile. CVPR.
- Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei (2024). Differential Transformer. arXiv.
- Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed (2020). Big Bird: Transformers for Longer Sequences. NeurIPS.
- Yanli Zhao, Andrew Gu, Rohan Varma (2023). PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. VLDB.
- Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian (2024). GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection. ICML.
- Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Siyuan Zhuang, Ying Sheng, Ion Stoica (2024). SGLang: Efficient Execution of Structured Language Model Programs. NeurIPS.