Taxonomy of Approaches
Taxonomy of Approaches
Agentic search methods span a wide spectrum of complexity and autonomy. We organize them by their level of agency and the sophistication of their search strategies:
-
Tool-Augmented Retrieval: LLMs that can invoke search tools as part of their generation process (WebGPT (Nakano, 2021), Toolformer (Schick et al., 2023), ReAct (Yao et al., 2023), TALM (Parisi et al., 2022), Gorilla (Patil et al., 2023)). These represent the simplest form of agentic search: the model decides when to search and formulates queries, but typically performs only one or two search rounds. The key innovation is giving the LLM agency over when and how to search, rather than always retrieving. ReAct's interleaving of reasoning traces ("Thought") with actions ("Action") has become the standard design pattern for LLM agents, enabling interpretable and systematic search strategies. ToolBench (Qin et al., 2023) provides the most comprehensive evaluation, covering 16,000+ real-world APIs.
-
Retrieval-Augmented Generation (RAG): Systems that retrieve relevant documents and condition the LLM's generation on this retrieved context (REALM (Guu et al., 2020), RAG (Lewis et al., 2020), FiD (Izacard & Grave, 2021), RETRO (Borgeaud et al., 2022), Atlas (Izacard et al., 2023), Self-RAG (Asai et al., 2024), GraphRAG (Edge et al., 2024), CRAG (Yan et al., 2024)). RAG systems range from simple single-retrieval pipelines to sophisticated architectures with adaptive retrieval, re-ranking, and self-reflection. The fundamental insight is that retrieval can substitute for parametric scale -- RETRO with 7B parameters matches a 175B model without retrieval, and Atlas with 11B matches 540B PaLM. Modern RAG increasingly uses modular architectures (Gao et al., 2024) where each component (retrieval, reranking, filtering, generation, verification) can be independently optimized.
-
Multi-Hop Reasoning Search: Systems that perform iterative retrieval interleaved with reasoning, combining evidence across multiple search steps (IRCoT (Trivedi et al., 2023), FLARE (Jiang et al., 2023), DSP/DSPy [@khattab2023dsp, @khattab2024dspy], ITER-RETGEN (Shao et al., 2023), Adaptive-RAG (Jeong et al., 2024)). These methods are specifically designed for questions that cannot be answered from a single retrieval step. The compositionality gap (Press et al., 2023) -- the gap between single-hop and multi-hop reasoning performance -- motivates this entire family. DSPy has been particularly influential in systematizing the optimization of multi-hop pipelines, treating prompt engineering as a programming problem with automatic optimization.
-
Agentic Web Search: Autonomous agents that browse the web and interact with web interfaces (WebGPT (Nakano, 2021), WebArena (Zhou et al., 2024), VisualWebArena (Koh et al., 2024), Mind2Web (Deng et al., 2024), BrowserGym (Drouin, 2024), WebVoyager (He, 2024)). These agents operate in the full complexity of the web, navigating pages, filling forms, and extracting information from diverse sources. The gap between human performance (~75-90% on these benchmarks) and agent performance (~15-30%) highlights the remaining challenge, though progress has been rapid with frontier models. Computer-use agents (Anthropic, 2024) extend this paradigm to general GUI interaction.
-
Search with Planning (MCTS + LLMs): Systems that use tree search and planning algorithms for systematic exploration of solution and reasoning spaces (Tree-of-Thought (Yao et al., 2023), RAP (Hao et al., 2023), LATS (Zhou et al., 2024), AlphaProof (DeepMind, 2024), Reflexion (Shinn et al., 2023), Self-Refine (Madaan et al., 2023)). These methods apply the principled exploration-exploitation framework from game-playing AI to search and reasoning problems. Best-of-N sampling with process reward models (Lightman et al., 2023) provides a simpler but effective alternative to full tree search, improving MATH accuracy from ~50% to ~78% with PRM-guided selection. The connection to test-time compute scaling (Snell et al., 2024) suggests that search at inference time can be as valuable as scaling pre-training compute.
-
Search in Code and Mathematics: Specialized search systems for domains with verifiable outcomes, including competitive programming (AlphaCode (Li et al., 2022), AlphaCode 2 (DeepMind, 2023)), software engineering (SWE-Agent (Yang et al., 2024), OpenHands (Wang et al., 2024)), and formal theorem proving (AlphaProof (DeepMind, 2024), DeepSeek-Prover (Xin, 2024), ReProver (Yang, 2023)). Verifiability enables more effective search through test-based filtering and formal verification. The pass@k metric directly quantifies the value of search: the gap between pass@1 and pass@100 measures how much search budget improves outcomes. SWE-bench resolution rates have progressed from ~3% to ~50%+ in under two years, demonstrating rapid capability gains.
-
Self-Improving Search and Deep Research: Systems that conduct comprehensive, autonomous research through multi-turn search and synthesis (STORM (Shao et al., 2024), OpenAI Deep Research, Google Gemini Deep Research, Perplexity AI). These represent the most advanced form of agentic search, autonomously planning and executing complex research workflows that would take a human researcher hours or days. STORM simulates multi-perspective expert conversations to drive targeted searches, while deep research systems execute dozens to hundreds of searches with iterative refinement. AI-powered search engines like Perplexity and ChatGPT Search have brought this paradigm to millions of users.
These categories are not mutually exclusive: modern systems often combine elements from multiple categories. For example, a deep research system might use RAG as its retrieval backbone, multi-hop reasoning for evidence synthesis, MCTS for exploring alternative research directions, and tool-augmented retrieval for accessing diverse information sources. The trend is toward increasingly autonomous systems that combine all of these capabilities with minimal human intervention.
References
- Anthropic (2024). Introducing Computer Use. Anthropic Blog.
- Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, Hannaneh Hajishirzi (2024). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. ICLR.
- Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann (2022). Improving Language Models by Retrieving from Trillions of Tokens. ICML.
- Google DeepMind (2023). AlphaCode 2 Technical Report. Google DeepMind.
- Google DeepMind (2024). AI Achieves Silver-Medal Standard Solving International Mathematical Olympiad Problems. Google DeepMind Blog.
- Xiang Deng, Yu Gu, Boyuan Zheng (2024). Mind2Web: Towards a Generalist Agent for the Web. NeurIPS.
- Alexandre Drouin (2024). The BrowserGym Ecosystem for Web Agent Research. arXiv.
- Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Jonathan Larson (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv.
- Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, Haofen Wang (2024). Modular RAG: Transforming RAG Systems into LEGO-like Reconfigurable Frameworks. arXiv.
- Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, Ming-Wei Chang (2020). Retrieval Augmented Language Model Pre-Training. ICML.
- Shibo Hao, Yi Gu, Haodi Ma (2023). Reasoning with Language Model is Planning with World Model. EMNLP.
- Hongliang He (2024). WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. ACL.
- Gautier Izacard, Edouard Grave (2021). Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. EACL.
- Gautier Izacard, Patrick Lewis, Maria Lomeli (2023). Atlas: Few-shot Learning with Retrieval Augmented Language Models. JMLR.
- Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, Jong C. Park (2024). Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. NAACL.
- Zhengbao Jiang, Frank F. Xu, Luyu Gao (2023). Active Retrieval Augmented Generation. EMNLP.
- Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried (2024). VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. ACL.
- Patrick Lewis, Ethan Perez, Aleksandra Piktus (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.
- Yujia Li, David Choi, Junyoung Chung (2022). Competition-Level Code Generation with AlphaCode. Science.
- Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe (2023). Let's Verify Step by Step. ICLR.
- Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, Peter Clark (2023). Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS.
- Reiichiro Nakano (2021). WebGPT: Browser-assisted Question-Answering with Human Feedback. arXiv.
- Aaron Parisi, Yao Zhao, Noah Fishi (2022). TALM: Tool Augmented Language Models. arXiv.
- Shishir G. Patil, Tianjun Zhang, Xin Wang, Joseph E. Gonzalez (2023). Gorilla: Large Language Model Connected with Massive APIs. arXiv.
- Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, Mike Lewis (2023). Measuring and Narrowing the Compositionality Gap in Language Models. EMNLP Findings.
- Yujia Qin, Shengding Hu, Yankai Lin (2023). Tool Learning with Foundation Models. arXiv.
- Timo Schick, Jane Dwivedi-Yu, Roberto Dessi (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS.
- Zhihong Shao, Yeyun Gong, Yelong Shen (2023). Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy. EMNLP Findings.
- Yijia Shao, Yucheng Jiang, Theodore A. Kanell (2024). Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models. NAACL.
- Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS.
- Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv.
- Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, Ashish Sabharwal (2023). Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. ACL.
- Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Zilong Wang, Likun Liu, Junyang Lin, Graham Neubig (2024). OpenHands: An Open Platform for AI Software Developers as Generalist Agents. arXiv.
- Huajian Xin (2024). DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data. arXiv.
- Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, Zhen-Hua Ling (2024). Corrective Retrieval Augmented Generation. arXiv.
- Kaiyu Yang (2023). LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. NeurIPS.
- John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, Ofir Press (2024). SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. arXiv.
- Shunyu Yao, Jeffrey Zhao, Dian Yu (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR.
- Shunyu Yao, Dian Yu, Jeffrey Zhao (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS.
- Shuyan Zhou, Frank F. Xu, Hao Zhu (2024). WebArena: A Realistic Web Environment for Building Autonomous Agents. ICLR.
- Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman (2024). Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models. ICML.