Introduction & Motivation

The scaling paradigm that has driven the deep learning revolution (larger models, more data, more compute) is approaching fundamental limits. Training frontier language models now costs tens of millions of dollars, consumes megawatt-hours of electricity, and requires months on thousands of accelerators [@kaplan2020scaling, @hoffmann2022training]. Reported estimates place the cost of training the largest frontier models in the tens to hundreds of millions of dollars, and the trend points toward still larger budgets for subsequent models. Inference costs scale with user demand, and for widely deployed models the aggregate inference cost can dwarf training cost within months of deployment. The environmental impact of this trajectory is increasingly scrutinized: Strubell et al. (2019) (Strubell et al., 2019) estimated that training a Transformer with neural architecture search produces carbon emissions comparable to five cars over their lifetimes, and the situation has only intensified as model sizes have grown by orders of magnitude since.

Efficiency in AI architecture design has thus become a first-class objective, no longer a secondary concern to be addressed after accuracy is maximized. The goal is to achieve the same or better performance with fewer FLOPs, less memory, lower latency, or reduced energy consumption. This is not merely an engineering challenge; it is a fundamental research question about the nature of computation in neural networks. The observation that most neural network parameters are redundant (networks can be pruned by 90%+ with minimal accuracy loss [@han2015learning, @frankle2019lottery]) suggests that our architectures are profoundly inefficient in how they use parameters and compute. Understanding and eliminating this inefficiency is both practically urgent and scientifically important.

The field has been transformed by several converging forces:

The compute wall. Moore's Law has slowed dramatically, with the historical pace of transistor density doubling roughly every 18 months no longer holding and improvements now arriving far more slowly. Meanwhile, the compute used to train models in the deep learning era has been doubling roughly every six months (about 4x per year) (Sevilla et al., 2022). This divergence means that architectural efficiency gains are increasingly the primary lever for continued capability growth.
The memory wall. Memory bandwidth, not compute, is often the bottleneck in modern AI systems. Transformer inference is memory-bandwidth-bound during autoregressive generation, where each token requires reading the full model weights and the growing KV-cache from memory. This has motivated architectural innovations (multi-query attention, KV-cache compression) and systems innovations (PagedAttention, speculative decoding) that specifically target memory efficiency.
The deployment imperative. As AI systems move from research labs to production deployment, powering search engines, coding assistants, autonomous vehicles, and medical diagnostics, the cost of inference per query becomes a binding economic constraint. A model that is 2x more efficient at inference can serve twice as many users at the same cost, or the same users at half the cost. This has driven intense focus on inference optimization techniques.
The democratization goal. The concentration of frontier AI capability in a handful of well-funded organizations has motivated work on making capable models accessible to broader communities. Quantization techniques like GPTQ (Frantar et al., 2023), together with community runtimes such as GGML, have enabled running billion-parameter models on consumer hardware. Parameter-efficient fine-tuning methods like LoRA (Hu et al., 2022) and QLoRA (Dettmers et al., 2023) have enabled customizing large models on single GPUs. These democratization-driven innovations have become some of the most impactful efficiency advances.
The long-context frontier. Applications requiring very long contexts (entire codebases, full books, long videos) push against the quadratic cost of standard attention. The need for million-token context windows has driven architectural innovations from sub-quadratic attention mechanisms to state space models to hybrid architectures.

This chapter surveys the landscape of efficient architecture design, organized by technique family. We cover efficient attention mechanisms and sub-quadratic alternatives to the Transformer, state space models, Mixture-of-Experts, parameter-efficient fine-tuning, neural architecture search, model compression, training efficiency, and inference optimization. Throughout, we emphasize not just the techniques themselves but the underlying principles (IO-awareness, conditional computation, low-rank structure, and hardware-algorithm co-design) that connect seemingly disparate approaches into a coherent framework for efficient AI.

Introduction & Motivation​

References

Introduction & Motivation