Connections to Other Chapters
Connections to Other Chapters
Continual Learning (Chapter 1): Efficient architecture design and continual learning are deeply intertwined. MoE architectures naturally support continual learning by routing new tasks to new or existing experts while freezing old ones, providing both computational efficiency and forgetting prevention through conditional computation. Parameter-efficient fine-tuning methods (LoRA, adapters) serve dual purposes: they reduce the cost of adaptation and naturally mitigate forgetting by limiting the number of parameters modified per task. O-LoRA (Wang, 2023) explicitly connects PEFT to continual learning by constraining new LoRA modules to be orthogonal to previous ones. Architecture-based continual learning methods (PackNet, HAT, SupSup) from Chapter 1 are essentially efficient parameter isolation techniques that exploit network sparsity. The model merging paradigm (task arithmetic, TIES-Merging) can be viewed as an efficient alternative to continual learning that avoids sequential training entirely.
World Models (Chapter 2): World models benefit directly from efficient architectures across multiple dimensions. SSMs (Mamba) can model long-horizon dynamics with linear cost, enabling world models to simulate longer rollouts without the quadratic cost of attention over long temporal sequences. Efficient attention mechanisms (FlashAttention, sparse attention) enable video-based world models to process longer temporal contexts, and MoE can scale world models across diverse domains (different physics engines, different environments) without proportional compute increases. The latency requirements of model-based planning (where the world model must be queried many times during search) make inference optimization particularly critical for world model applications. State space models' connection to control theory (Chapter 2's mathematical foundation) provides a natural bridge between efficient sequence modeling and world model dynamics.
Agentic Search (Chapter 4): Efficient inference is critical for agentic systems that perform many model queries during search, planning, and retrieval. An agentic system that queries an LLM hundreds of times per user request amplifies the cost of each query by the fan-out factor, making inference optimization (speculative decoding, KV-cache management, quantization) directly impact the economic viability of agentic applications. The tree search methods in Chapter 4 (MCTS-LLM, beam search over reasoning paths) require efficient batch inference for parallel candidate evaluation. RAG systems benefit from efficient embedding models (for retrieval) and efficient generation models (for synthesis), connecting retrieval efficiency to architecture efficiency. Multi-turn agentic conversations can reuse KV-cache across turns (RadixAttention in SGLang), connecting serving system design to agentic workflow efficiency.
Randomized Algorithms (Chapter 5): Many efficient architecture techniques have deep roots in randomized algorithms. Random feature approximations for attention kernels (Performers, RFA) directly apply the random Fourier feature framework from the kernel methods literature. Locality-sensitive hashing in Reformer connects sparse attention to the hashing techniques surveyed in Chapter 5. Quantization can be viewed as a form of random rounding with provable approximation guarantees. Sketching algorithms enable efficient gradient compression for distributed training. Random projections underlie the dimensionality reduction in SSM state initialization (HiPPO). The streaming algorithms perspective from Chapter 5 -- processing data in a single pass with bounded memory -- provides the theoretical framework for online inference optimization (KV-cache eviction, streaming attention). The Johnson-Lindenstrauss lemma provides the theoretical foundation for why low-rank approximations (LoRA) and random projections preserve the essential structure of high-dimensional computations.