Self-Improving Search & AI-Powered Search Engines
Self-Improving Search & AI-Powered Search Engines
STORM
Shao et al. (2024) (Shao et al., 2024) introduced STORM (Synthesis of Topic Outlines through Retrieval and Multi-perspective question asking), a system that writes Wikipedia-style articles through multi-turn search and synthesis. STORM first researches a topic by simulating conversations with topic experts (generated by the LLM), using each expert's questions to drive targeted searches. It then synthesizes the gathered information into a structured article with proper citations. The multi-perspective approach is crucial: by generating diverse expert personas (a historian, an economist, a social scientist), STORM produces questions from different angles, leading to more comprehensive coverage than a single-perspective approach. On a blind evaluation against real Wikipedia articles, expert annotators rated STORM's articles as comparable in breadth and organization, though human-written articles still exhibited deeper analysis and more nuanced contextualization. STORM 2 extended this with discourse management and a more sophisticated outline generation process, achieving article quality comparable to human-written Wikipedia articles on some topics.
Deep Research Systems
A major trend in agentic search is the development of "deep research" systems that autonomously conduct comprehensive research on complex topics, producing detailed reports with citations.
OpenAI Deep Research (2025) uses o3-based reasoning with iterative web search to produce multi-page research reports. The system plans a research strategy, executes dozens to hundreds of searches, reads and synthesizes the results, and produces a structured report with citations -- all autonomously. This represents the most capable current instantiation of agentic search, handling complex research questions that would take a human hours of searching and reading.
Google Gemini Deep Research (2025) similarly combines Gemini's reasoning with Google Search to produce comprehensive research reports. The system creates a research plan visible to the user, then executes the plan autonomously.
These systems demonstrate the practical viability of fully autonomous multi-step search for real-world research tasks. The key challenge they address is scope management: knowing when enough information has been gathered and when to stop searching, balancing thoroughness against computational cost.
Iterative Retrieval and Refinement
Several research systems implement iterative search loops where the model alternates between retrieval, reading, and query refinement [@xu2024search, @kim2023tree]. Each iteration refines the model's understanding, enabling it to ask more targeted questions. The theoretical justification is clear: a single search query can only capture one facet of a complex information need, while iterative refinement progressively narrows the gap between the user's actual information need and the retrieved evidence. Empirically, iterative retrieval systems consistently outperform single-pass retrieval on multi-hop benchmarks: IRCoT (Trivedi et al., 2023) showed that interleaving retrieval with chain-of-thought reasoning improves multi-hop QA accuracy by 15-20% over single-pass retrieval on HotpotQA and MuSiQue. Key techniques include:
- Query reformulation: Rewriting queries based on initial results to find more relevant information, using techniques ranging from rule-based expansion (adding synonyms, removing ambiguous terms) to LLM-based rewriting (Ma, 2023). HyDE (Gao et al., 2023) uses hypothetical document generation: the LLM generates a hypothetical answer to the question, and this hypothetical answer is used as the search query, often retrieving more relevant documents than the original question because the hypothetical answer is expressed in the vocabulary of potential target documents.
- Gap detection: Identifying what information is still missing and generating targeted queries to fill those gaps. This requires the model to maintain an explicit representation of its current knowledge state and compare it against the requirements of the original question.
- Evidence verification: Cross-checking claims against multiple sources before including them. When sources disagree, the system must assess source reliability and present the disagreement transparently.
- Confidence-based stopping: Terminating search when the model's confidence in its answer exceeds a threshold. This connects to the optimal stopping problem from decision theory: searching too little produces incomplete answers, while searching too much wastes resources and can introduce noise from marginally relevant results.
Autonomous Agent Frameworks
The emergence of autonomous agent frameworks has catalyzed research and development in agentic search. Auto-GPT (2023) (Gravitas, 2023) popularized the concept of autonomous LLM agents that can break down goals into sub-tasks, execute them through tool use, and self-improve through reflection. While Auto-GPT's initial performance was limited, it catalyzed intense research interest in autonomous agents.
Voyager (Wang et al., 2023) (Wang et al., 2023) demonstrated an open-ended embodied agent in Minecraft that uses LLMs for task decomposition, code generation for skills, and a skill library for lifelong learning. Voyager's approach -- decomposing goals, generating executable skills, and storing successful skills for reuse -- provides a template for self-improving agentic systems that accumulate capabilities over time. The skill library acts as a form of procedural memory that grows with experience, enabling the agent to tackle increasingly complex tasks.
LangChain and LlamaIndex have emerged as the dominant open-source frameworks for building RAG and agentic search applications. LangChain provides composable abstractions for chains (sequential LLM calls), agents (LLMs with tool access), and retrieval pipelines, enabling rapid prototyping of agentic systems. LlamaIndex focuses specifically on data indexing and retrieval, providing specialized abstractions for ingesting, structuring, and querying diverse data sources. These frameworks have been instrumental in the widespread adoption of agentic search techniques in production applications, though they primarily serve as engineering tools rather than research contributions. CrewAI and AutoGen (Wu et al., 2023) (Wu et al., 2023) extend the single-agent paradigm to multi-agent collaboration, where multiple specialized agents coordinate to solve complex research tasks.
AI-Powered Search Engines
The agentic search paradigm has given rise to a new category of consumer products: AI-powered search engines that combine traditional web search with LLM-based synthesis.
Perplexity AI pioneered this category, providing a search engine that returns synthesized answers with inline citations rather than a list of links. Perplexity's architecture combines web search (multiple search engines), document reading (extracting relevant content from search results), and LLM synthesis (generating a coherent answer with citations). The system iteratively searches and refines when initial results are insufficient.
SearchGPT / ChatGPT Search (OpenAI) integrates web search directly into ChatGPT, allowing the model to search the web when its parametric knowledge is insufficient and cite the sources it uses. This represents the convergence of conversational AI with information retrieval.
These products demonstrate that agentic search has crossed from research into deployment, with millions of users relying on LLM-mediated search daily. Key challenges include: accuracy of citations (the cited source must actually support the claim), handling of conflicting information across sources, freshness of information, and the economic model for compensating original content creators whose work is synthesized. The development of long-context RAG techniques (Jiang et al., 2024) that can process entire documents rather than short chunks may further improve the quality of AI-powered search by reducing fragmentation of evidence.
References
- Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan (2023). Precise Zero-Shot Dense Retrieval without Relevance Labels. ACL.
- Significant Gravitas (2023). Auto-GPT: An Autonomous GPT-4 Experiment. GitHub.
- Ziyan Jiang, Xueguang Ma, Wenhu Chen (2024). LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs. arXiv.
- Xinbei Ma (2023). Query Rewriting for Retrieval-Augmented Large Language Models. EMNLP.
- Yijia Shao, Yucheng Jiang, Theodore A. Kanell (2024). Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models. NAACL.
- Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, Ashish Sabharwal (2023). Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. ACL.
- Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, Anima Anandkumar (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv.
- Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, Chi Wang (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv.