Taxonomy of Approaches

Agentic search methods span a wide spectrum of complexity and autonomy. Our top level deliberately mixes two axes: a capability axis (categories 1, 2, 3, and 5 distinguish methods by level of agency and search algorithm) and a domain axis (categories 4 and 6 group methods by application area, namely the open web and verifiable code/mathematics). Category 7 spans both, describing autonomous research systems that recombine the earlier capabilities. We adopt this hybrid cut because the domain-specific families (web browsing, code, and mathematics) carry distinctive benchmarks and evaluation conventions that a purely capability-based taxonomy would obscure. The categories are:

Tool-Augmented Retrieval: LLMs that can invoke search tools as part of their generation process (WebGPT (Nakano, 2021), Toolformer (Schick et al., 2023), ReAct (Yao et al., 2023), TALM (Parisi et al., 2022), Gorilla (Patil et al., 2023)). These represent the simplest form of agentic search: the model decides when to search and formulates queries, but typically performs only one or two search rounds. The key innovation is giving the LLM agency over when and how to search, rather than always retrieving. ReAct's interleaving of reasoning traces ("Thought") with actions ("Action") has become the standard design pattern for LLM agents, enabling interpretable and systematic search strategies. ToolLLM (Qin et al., 2024) provides the most comprehensive evaluation, covering 16,000+ real-world APIs. The limited search budget, however, means these systems struggle on questions requiring deep multi-hop reasoning or synthesis across many sources.
Retrieval-Augmented Generation (RAG): Systems that retrieve relevant documents and condition the LLM's generation on this retrieved context (REALM (Guu et al., 2020), RAG (Lewis et al., 2020), FiD (Izacard & Grave, 2021), RETRO (Borgeaud et al., 2022), Atlas (Izacard et al., 2023), Self-RAG (Asai et al., 2024), GraphRAG (Edge et al., 2024), CRAG (Yan et al., 2024)). RAG systems range from simple single-retrieval pipelines to sophisticated architectures with adaptive retrieval, re-ranking, and self-reflection. The fundamental insight is that retrieval can substitute for parametric scale: RETRO with 7B parameters matches a 175B model without retrieval, and Atlas with 11B matches 540B PaLM. Modern RAG increasingly uses modular architectures (Gao et al., 2024) where each component (retrieval, reranking, filtering, generation, verification) can be independently optimized. A key limitation is that even the most sophisticated RAG systems depend critically on the quality and coverage of the underlying retrieval corpus; they cannot answer questions about information outside their indexed documents.
Multi-Hop Reasoning Search: Systems that perform iterative retrieval interleaved with reasoning, combining evidence across multiple search steps (IRCoT (Trivedi et al., 2023), FLARE (Jiang et al., 2023), DSP/DSPy [@khattab2023dsp, @khattab2024dspy], ITER-RETGEN (Shao et al., 2023), Adaptive-RAG (Jeong et al., 2024)). These methods are specifically designed for questions that cannot be answered from a single retrieval step. The compositionality gap (Press et al., 2023), the gap between single-hop and multi-hop reasoning performance, motivates this entire family. DSPy has been particularly influential in systematizing the optimization of multi-hop pipelines, treating prompt engineering as a programming problem with automatic optimization. The main drawback is increased latency and cost from multiple retrieval rounds, which can compound when intermediate retrieval steps return irrelevant or misleading information.
Agentic Web Search: Autonomous agents that browse the web and interact with web interfaces (WebArena (Zhou et al., 2024), VisualWebArena (Koh et al., 2024), Mind2Web (Deng et al., 2024), BrowserGym (Drouin, 2024), WebVoyager (He, 2024)). WebGPT (Nakano, 2021) is a boundary case between this family and category 1: we home it under Tool-Augmented Retrieval because it exposes the web through a fixed set of search-and-navigate commands rather than full open-ended browsing, but it is equally read as an early agentic web searcher. These agents operate in the full complexity of the web, navigating pages, filling forms, and extracting information from diverse sources. The gap between human performance (~75-90% on these benchmarks) and agent performance (~15-30%) highlights the remaining challenge, though progress has been rapid with frontier models. Computer-use agents (Anthropic, 2024) extend this paradigm to general GUI interaction. The open-ended nature of web browsing, however, makes these systems fragile to unexpected page layouts, dynamic content, and authentication barriers that block programmatic access.
Search with Planning (MCTS + LLMs): Systems that use tree search and planning algorithms for systematic exploration of solution and reasoning spaces (Tree-of-Thought (Yao et al., 2023), RAP (Hao et al., 2023), LATS (Zhou et al., 2024), AlphaProof (DeepMind, 2024), Reflexion (Shinn et al., 2023), Self-Refine (Madaan et al., 2023)). These methods apply the principled exploration-exploitation framework from game-playing AI to search and reasoning problems. Best-of-N sampling with process reward models (Lightman et al., 2023) provides a simpler but effective alternative to full tree search: starting from a generator single-sample (pass@1) baseline of roughly 50% on MATH, PRM-guided best-of-N selection reaches approximately 78% (78.2% at N=1860), compared with approximately 63% for majority voting (self-consistency) at the same sample budget. The connection to test-time compute scaling (Snell et al., 2024) suggests that search at inference time can be as valuable as scaling pre-training compute. The primary cost is computational: tree search methods can require orders of magnitude more inference-time compute than single-path generation, making them impractical for latency-sensitive applications.
Search in Code and Mathematics: Specialized search systems for domains with verifiable outcomes, including competitive programming (AlphaCode (Li et al., 2022), AlphaCode 2 (DeepMind, 2023)), software engineering (SWE-Agent (Yang et al., 2024), OpenHands (Wang et al., 2024)), and formal theorem proving (AlphaProof (DeepMind, 2024), DeepSeek-Prover (Xin, 2024), ReProver (Yang, 2023)). Verifiability enables more effective search through test-based filtering and formal verification. The pass@k metric directly quantifies the value of search: the gap between pass@1 and pass@100 measures how much search budget improves outcomes. SWE-bench resolution rates have progressed from ~3% to ~50%+ in under two years, demonstrating rapid capability gains. The reliance on automated verification, however, confines these methods to domains with executable tests or formal specifications, limiting applicability to open-ended design or exploratory programming tasks.
Self-Improving Search and Deep Research: Systems that conduct comprehensive, autonomous research through multi-turn search and synthesis. The published representative is STORM (Shao et al., 2024), which simulates multi-perspective expert conversations to drive targeted searches before synthesizing a long-form article. A parallel set of productized systems (OpenAI Deep Research, Google Gemini Deep Research, Perplexity AI, ChatGPT Search) applies the same idea at commercial scale, though without peer-reviewed method papers. These represent the most advanced form of agentic search, autonomously planning and executing complex research workflows that would take a human researcher hours or days. The defining contrast with earlier families is search budget: where tool-augmented retrieval (category 1) typically issues one or two queries, deep research systems execute dozens to hundreds of searches with iterative refinement, trading large amounts of inference-time compute for broader coverage and synthesis quality. The tradeoff is end-to-end latency and cost: completing a deep research task can require minutes to hours and hundreds of LLM calls, making interactive use cases difficult.

These categories are not mutually exclusive: modern systems often combine elements from multiple categories. For example, a deep research system might use RAG as its retrieval backbone, multi-hop reasoning for evidence synthesis, MCTS for exploring alternative research directions, and tool-augmented retrieval for accessing diverse information sources. The trend is toward increasingly autonomous systems that combine all of these capabilities with minimal human intervention.

Taxonomy of Approaches​

References

Taxonomy of Approaches