Skip to main content

Taxonomy of Approaches

Taxonomy of Approaches

Agentic search methods span a wide spectrum of complexity and autonomy. We organize them by their level of agency and the sophistication of their search strategies:

  1. Tool-Augmented Retrieval: LLMs that can invoke search tools as part of their generation process (WebGPT (Nakano, 2021), Toolformer (Schick et al., 2023), ReAct (Yao et al., 2023), TALM (Parisi et al., 2022), Gorilla (Patil et al., 2023)). These represent the simplest form of agentic search: the model decides when to search and formulates queries, but typically performs only one or two search rounds. The key innovation is giving the LLM agency over when and how to search, rather than always retrieving. ReAct's interleaving of reasoning traces ("Thought") with actions ("Action") has become the standard design pattern for LLM agents, enabling interpretable and systematic search strategies. ToolBench (Qin et al., 2023) provides the most comprehensive evaluation, covering 16,000+ real-world APIs.

  2. Retrieval-Augmented Generation (RAG): Systems that retrieve relevant documents and condition the LLM's generation on this retrieved context (REALM (Guu et al., 2020), RAG (Lewis et al., 2020), FiD (Izacard & Grave, 2021), RETRO (Borgeaud et al., 2022), Atlas (Izacard et al., 2023), Self-RAG (Asai et al., 2024), GraphRAG (Edge et al., 2024), CRAG (Yan et al., 2024)). RAG systems range from simple single-retrieval pipelines to sophisticated architectures with adaptive retrieval, re-ranking, and self-reflection. The fundamental insight is that retrieval can substitute for parametric scale -- RETRO with 7B parameters matches a 175B model without retrieval, and Atlas with 11B matches 540B PaLM. Modern RAG increasingly uses modular architectures (Gao et al., 2024) where each component (retrieval, reranking, filtering, generation, verification) can be independently optimized.

  3. Multi-Hop Reasoning Search: Systems that perform iterative retrieval interleaved with reasoning, combining evidence across multiple search steps (IRCoT (Trivedi et al., 2023), FLARE (Jiang et al., 2023), DSP/DSPy [@khattab2023dsp, @khattab2024dspy], ITER-RETGEN (Shao et al., 2023), Adaptive-RAG (Jeong et al., 2024)). These methods are specifically designed for questions that cannot be answered from a single retrieval step. The compositionality gap (Press et al., 2023) -- the gap between single-hop and multi-hop reasoning performance -- motivates this entire family. DSPy has been particularly influential in systematizing the optimization of multi-hop pipelines, treating prompt engineering as a programming problem with automatic optimization.

  4. Agentic Web Search: Autonomous agents that browse the web and interact with web interfaces (WebGPT (Nakano, 2021), WebArena (Zhou et al., 2024), VisualWebArena (Koh et al., 2024), Mind2Web (Deng et al., 2024), BrowserGym (Drouin, 2024), WebVoyager (He, 2024)). These agents operate in the full complexity of the web, navigating pages, filling forms, and extracting information from diverse sources. The gap between human performance (~75-90% on these benchmarks) and agent performance (~15-30%) highlights the remaining challenge, though progress has been rapid with frontier models. Computer-use agents (Anthropic, 2024) extend this paradigm to general GUI interaction.

  5. Search with Planning (MCTS + LLMs): Systems that use tree search and planning algorithms for systematic exploration of solution and reasoning spaces (Tree-of-Thought (Yao et al., 2023), RAP (Hao et al., 2023), LATS (Zhou et al., 2024), AlphaProof (DeepMind, 2024), Reflexion (Shinn et al., 2023), Self-Refine (Madaan et al., 2023)). These methods apply the principled exploration-exploitation framework from game-playing AI to search and reasoning problems. Best-of-N sampling with process reward models (Lightman et al., 2023) provides a simpler but effective alternative to full tree search, improving MATH accuracy from ~50% to ~78% with PRM-guided selection. The connection to test-time compute scaling (Snell et al., 2024) suggests that search at inference time can be as valuable as scaling pre-training compute.

  6. Search in Code and Mathematics: Specialized search systems for domains with verifiable outcomes, including competitive programming (AlphaCode (Li et al., 2022), AlphaCode 2 (DeepMind, 2023)), software engineering (SWE-Agent (Yang et al., 2024), OpenHands (Wang et al., 2024)), and formal theorem proving (AlphaProof (DeepMind, 2024), DeepSeek-Prover (Xin, 2024), ReProver (Yang, 2023)). Verifiability enables more effective search through test-based filtering and formal verification. The pass@k metric directly quantifies the value of search: the gap between pass@1 and pass@100 measures how much search budget improves outcomes. SWE-bench resolution rates have progressed from ~3% to ~50%+ in under two years, demonstrating rapid capability gains.

  7. Self-Improving Search and Deep Research: Systems that conduct comprehensive, autonomous research through multi-turn search and synthesis (STORM (Shao et al., 2024), OpenAI Deep Research, Google Gemini Deep Research, Perplexity AI). These represent the most advanced form of agentic search, autonomously planning and executing complex research workflows that would take a human researcher hours or days. STORM simulates multi-perspective expert conversations to drive targeted searches, while deep research systems execute dozens to hundreds of searches with iterative refinement. AI-powered search engines like Perplexity and ChatGPT Search have brought this paradigm to millions of users.

These categories are not mutually exclusive: modern systems often combine elements from multiple categories. For example, a deep research system might use RAG as its retrieval backbone, multi-hop reasoning for evidence synthesis, MCTS for exploring alternative research directions, and tool-augmented retrieval for accessing diverse information sources. The trend is toward increasingly autonomous systems that combine all of these capabilities with minimal human intervention.


References