Self-Improving Search & AI-Powered Search Engines

STORM

Shao et al. (2024) (Shao et al., 2024) introduced STORM (Synthesis of Topic Outlines through Retrieval and Multi-perspective question asking), a system that writes Wikipedia-style articles through multi-turn search and synthesis. STORM first researches a topic by simulating conversations with topic experts (generated by the LLM), using each expert's questions to drive targeted searches. It then synthesizes the gathered information into a structured article with proper citations. The multi-perspective approach is crucial: by generating diverse expert personas (a historian, an economist, a social scientist), STORM produces questions from different angles, leading to more comprehensive coverage than a single-perspective approach. In a rubric-based human evaluation on the FreshWiki dataset, experienced Wikipedia editors rated STORM's articles as comparable in breadth and organization to real Wikipedia articles, though as a qualitative finding the human-written articles still exhibited deeper analysis and more nuanced contextualization.

Deep Research Systems

A major trend in agentic search is the development of "deep research" systems that autonomously conduct comprehensive research on complex topics, producing detailed reports with citations.

OpenAI Deep Research (2025) (OpenAI, 2025) uses o3-based reasoning with iterative web search to produce multi-page research reports. The system plans a research strategy, executes dozens to hundreds of searches, reads and synthesizes the results, and produces a structured report with citations, all autonomously. It is among the more capable deployed instantiations of agentic search, handling complex research questions that would take a human hours of searching and reading, though precise comparisons depend on the benchmark (for example, BrowseComp (Wei et al., 2025) or GAIA (Mialon, 2024)) used.

Google Gemini Deep Research (2025) (Google, 2025) similarly combines Gemini's reasoning with Google Search to produce comprehensive research reports. The system creates a research plan visible to the user, then executes the plan autonomously.

The two differ mainly in user control over the research process: Gemini Deep Research surfaces an editable research plan before execution, giving the user a checkpoint to steer scope, whereas OpenAI Deep Research plans and executes more autonomously with less mid-run intervention. These systems demonstrate the practical viability of fully autonomous multi-step search for real-world research tasks. The key challenge they address is scope management: knowing when enough information has been gathered and when to stop searching, balancing thoroughness against computational cost.

Several research systems implement iterative search loops where the model alternates between retrieval, reading, and query refinement [@xu2024search, @kim2023tree]. Each iteration refines the model's understanding, enabling it to ask more targeted questions. The theoretical justification is clear: a single search query can only capture one facet of a complex information need, while iterative refinement progressively narrows the gap between the user's actual information need and the retrieved evidence. Empirically, iterative retrieval systems consistently outperform single-pass retrieval on multi-hop benchmarks: IRCoT (Trivedi et al., 2023) showed that interleaving retrieval with chain-of-thought reasoning improves multi-hop QA F1 by roughly 5 to 9 points on HotpotQA and MuSiQue (and up to 15 points on 2WikiMultihopQA) over single-pass retrieval. Key techniques include:

Query reformulation: Rewriting queries based on initial results to find more relevant information, using techniques ranging from rule-based expansion (adding synonyms, removing ambiguous terms) to LLM-based rewriting (Ma, 2023). HyDE (Gao et al., 2023) uses hypothetical document generation: the LLM generates a hypothetical answer to the question, and this hypothetical answer is used as the search query, often retrieving more relevant documents than the original question because the hypothetical answer is expressed in the vocabulary of potential target documents.
Gap detection: Identifying what information is still missing and generating targeted queries to fill those gaps. This requires the model to maintain an explicit representation of its current knowledge state and compare it against the requirements of the original question.
Evidence verification: Cross-checking claims against multiple sources before including them. When sources disagree, the system must assess source reliability and present the disagreement transparently.
Confidence-based stopping: Terminating search when the model's confidence in its answer exceeds a threshold. This connects to the optimal stopping problem from decision theory: searching too little produces incomplete answers, while searching too much wastes resources and can introduce noise from marginally relevant results.

Autonomous Agent Frameworks

The emergence of autonomous agent frameworks has catalyzed research and development in agentic search. Auto-GPT (2023) (Gravitas, 2023) popularized the concept of autonomous LLM agents that can break down goals into sub-tasks, execute them through tool use, and self-improve through reflection. While Auto-GPT's initial performance was limited, it catalyzed intense research interest in autonomous agents.

Voyager (Wang et al., 2023) (Wang et al., 2023) demonstrated an open-ended embodied agent in Minecraft that uses LLMs for task decomposition, code generation for skills, and a skill library for lifelong learning. Voyager's approach (decomposing goals, generating executable skills, and storing successful skills for reuse) provides a template for self-improving agentic systems that accumulate capabilities over time. The skill library acts as a form of procedural memory that grows with experience, enabling the agent to tackle increasingly complex tasks.

LangChain and LlamaIndex have emerged as the dominant open-source frameworks for building RAG and agentic search applications. LangChain provides composable abstractions for chains (sequential LLM calls), agents (LLMs with tool access), and retrieval pipelines, enabling rapid prototyping of agentic systems. LlamaIndex focuses specifically on data indexing and retrieval, providing specialized abstractions for ingesting, structuring, and querying diverse data sources. These frameworks have been instrumental in the widespread adoption of agentic search techniques in production applications, though they primarily serve as engineering tools rather than research contributions. A recurring practitioner critique is that their layered abstractions add overhead and reduce transparency: the chain and agent wrappers can obscure the underlying prompts and control flow, making failures harder to trace and debug than equivalent hand-written orchestration. CrewAI and AutoGen (Wu et al., 2023) (Wu et al., 2023) extend the single-agent paradigm to multi-agent collaboration, where multiple specialized agents coordinate to solve complex research tasks.

These frameworks trade off generality against control: Auto-GPT and the single-agent loop maximize autonomy but are hard to steer and prone to drifting off-task, LangChain and LlamaIndex expose composable primitives that hand control back to the developer at the cost of more engineering, and multi-agent systems such as AutoGen and CrewAI gain parallelism and role specialization while introducing coordination overhead and additional failure modes from inter-agent communication.

AI-Powered Search Engines

The agentic search paradigm has given rise to a new category of consumer products: AI-powered search engines that combine traditional web search with LLM-based synthesis.

Perplexity AI (AI, 2024) pioneered this category, providing a search engine that returns synthesized answers with inline citations rather than a list of links. Perplexity's architecture combines web search (multiple search engines), document reading (extracting relevant content from search results), and LLM synthesis (generating a coherent answer with citations). The system iteratively searches and refines when initial results are insufficient.

SearchGPT / ChatGPT Search (OpenAI) (OpenAI, 2024) integrates web search directly into ChatGPT, allowing the model to search the web when its parametric knowledge is insufficient and cite the sources it uses. This represents the convergence of conversational AI with information retrieval.

The two products embody different design priorities: Perplexity is search-first, defaulting to an explicit retrieve-read-synthesize pipeline with inline citations on essentially every answer, which favors citation grounding and breadth of sources; ChatGPT Search is conversation-first, invoking retrieval only when the model judges its parametric knowledge insufficient, which favors fluency and latency at the cost of less consistent source attribution.

These products demonstrate that agentic search has crossed from research into deployment, with millions of users relying on LLM-mediated search daily. Key challenges include: accuracy of citations (the cited source must actually support the claim), handling of conflicting information across sources, freshness of information, and the economic model for compensating original content creators whose work is synthesized. The development of long-context RAG techniques (Jiang et al., 2024) that can process entire documents rather than short chunks may further improve the quality of AI-powered search by reducing fragmentation of evidence.

Evaluation Across Families

The systems surveyed above are evaluated against largely disjoint benchmarks, which makes head-to-head comparison difficult and is itself a recurring limitation of the area. Three instruments dominate. FreshWiki (Shao et al., 2024) evaluates long-form report generation against curated Wikipedia articles, combining automatic overlap metrics with rubric-based human judgments of breadth, organization, and depth; it is the natural fit for STORM-style synthesis systems but says little about live retrieval skill. GAIA (Mialon, 2024) tests general assistant capability through multi-step questions that require tool use and reasoning, scored by exact-match against short ground-truth answers, and is widely used to position autonomous agents and deep-research systems. BrowseComp (Wei et al., 2025) targets browsing agents specifically, posing questions whose answers are hard to find but easy to verify, which isolates persistent multi-hop web navigation from synthesis quality. No single benchmark spans retrieval skill, reasoning depth, and report quality at once, so reported gains should be read against the named instrument rather than as a global ranking, and the field still lacks a shared evaluation that ties these families together.

Self-Improving Search & AI-Powered Search Engines