Connections to Other Chapters
Connections to Other Chapters
Continual Learning (Chapter 1): Efficient architecture design and continual learning are deeply intertwined. MoE architectures can support continual learning by routing new tasks to new or existing experts while freezing old ones, though this approach faces challenges including router collapse, expert under-utilization, capacity growth over tasks, and interference when experts are shared across tasks (Huang, 2024). Parameter-efficient fine-tuning methods (LoRA, adapters) serve dual purposes: they reduce the cost of adaptation and naturally mitigate forgetting by limiting the number of parameters modified per task. O-LoRA (Wang, 2023) explicitly connects PEFT to continual learning by constraining new LoRA modules to be orthogonal to previous ones. Architecture-based continual learning methods (PackNet, HAT, SupSup) from Chapter 1 are essentially efficient parameter isolation techniques that exploit network sparsity. The model merging paradigm (task arithmetic, TIES-Merging) can be viewed as an efficient alternative to continual learning that avoids sequential training entirely.
World Models (Chapter 2): World models benefit directly from efficient architectures across multiple dimensions. SSMs (Mamba) can model long-horizon dynamics with linear cost, enabling world models to simulate longer rollouts without the quadratic cost of attention over long temporal sequences. Efficient attention mechanisms (FlashAttention, sparse attention) enable video-based world models to process longer temporal contexts, and MoE can scale world models across diverse domains (different physics engines, different environments) without proportional compute increases. The latency requirements of model-based planning (where the world model must be queried many times during search) make inference optimization particularly critical for world model applications. State space models' connection to control theory (Chapter 2's mathematical foundation) provides a natural bridge between efficient sequence modeling and world model dynamics.
Agentic Search (Chapter 4): Efficient inference is critical for agentic systems that perform many model queries during search, planning, and retrieval. An agentic system that queries an LLM hundreds of times per user request amplifies the cost of each query by the fan-out factor, making inference optimization (speculative decoding, KV-cache management, quantization) directly impact the economic viability of agentic applications. The tree search methods in Chapter 4 (MCTS-LLM, beam search over reasoning paths) require efficient batch inference for parallel candidate evaluation. RAG systems benefit from efficient embedding models (for retrieval) and efficient generation models (for synthesis), connecting retrieval efficiency to architecture efficiency. Multi-turn agentic conversations can reuse KV-cache across turns (RadixAttention in SGLang (Zheng et al., 2024)), connecting serving system design to agentic workflow efficiency.
Randomized Algorithms (Chapter 5): Many efficient architecture techniques have deep roots in randomized algorithms. Random feature approximations for attention kernels (Performers, RFA) directly apply the random Fourier feature framework from the kernel methods literature. Locality-sensitive hashing in Reformer connects sparse attention to the hashing techniques surveyed in Chapter 5. Stochastic rounding variants of quantization can provide provable approximation guarantees by making the quantized value an unbiased estimator of the original (Gupta et al., 2015), though most production quantization methods (GPTQ, AWQ) use deterministic rounding. Sketching algorithms enable efficient gradient compression for distributed training. SSM state initialization (HiPPO (Gu et al., 2020)) instead uses deterministic structured projections onto orthogonal polynomial bases rather than random projections, offering a contrasting design point to the randomized methods above. The streaming algorithms perspective from Chapter 5 (processing data in a single pass with bounded memory) provides the theoretical framework for online inference optimization (KV-cache eviction, streaming attention). The Johnson-Lindenstrauss lemma provides the theoretical foundation for why random projections preserve the essential structure of high-dimensional computations, though low-rank methods like LoRA rely instead on the low intrinsic dimensionality of adaptation (Aghajanyan et al., 2021) rather than random projection guarantees.
References
- Armen Aghajanyan, Luke Zettlemoyer, Sonal Gupta (2021). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. ACL. ↗
- Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, Christopher Re (2020). HiPPO: Recurrent Memory with Optimal Polynomial Projections. NeurIPS. ↗
- Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan (2015). Deep Learning with Limited Numerical Precision. ICML (PMLR v37). ↗
- Cheng Huang (2024). LoRAMoE: Alleviating World Knowledge Forgetting in Large Language Models via MoE-Style Plugin. ACL. ↗
- Zhilin Wang (2023). Orthogonal Subspace Learning for Language Model Continual Learning. EMNLP Findings. ↗
- Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng (2024). SGLang: Efficient Execution of Structured Language Model Programs. arXiv. ↗