Benchmarks & Evaluation
Benchmarks & Evaluation
Evaluating agentic search systems is challenging because they operate at the intersection of retrieval, reasoning, generation, and tool use. Different benchmarks test different capabilities, and no single benchmark captures the full range of agentic search behaviors.
Retrieval Benchmarks
- KILT (Knowledge Intensive Language Tasks) (Petroni et al., 2021) (Petroni et al., 2021): A unified benchmark covering fact-checking, question answering, slot filling, entity linking, and dialogue, all requiring retrieval from Wikipedia. KILT's unified format enables fair comparison across diverse retrieval-augmented tasks. KILT evaluates both retrieval quality (R-precision, recall@5) and downstream task performance, revealing that retrieval quality is often the bottleneck: systems with identical readers but different retrievers show large performance gaps.
- BEIR (Benchmarking IR) (Thakur et al., 2021) (Thakur et al., 2021): A heterogeneous benchmark of 18 retrieval datasets spanning diverse domains (biomedical, financial, scientific) and task types (passage retrieval, question answering, fact verification). BEIR evaluates zero-shot retrieval performance, testing whether retrieval models generalize across domains without domain-specific training. A key finding from BEIR is that dense retrievers like DPR (Karpukhin et al., 2020), despite strong in-domain performance, often underperform BM25 on out-of-domain datasets -- motivating the development of more robust dense retrievers like DRAGON (Lin et al., 2023) and instruction-tuned retrievers like INSTRUCTOR (Su et al., 2023).
- MTEB (Massive Text Embedding Benchmark) (Muennighoff et al., 2023) (Muennighoff et al., 2023): The most comprehensive embedding evaluation, covering retrieval, classification, clustering, and semantic similarity across 56+ datasets in multiple languages. MTEB has become the standard leaderboard for text embedding models, revealing that no single model dominates all tasks: models tuned for retrieval may underperform on clustering, and vice versa.
Question Answering Benchmarks
- HotpotQA (Yang et al., 2018) (Yang et al., 2018): Multi-hop question answering requiring reasoning over two or more Wikipedia paragraphs. HotpotQA provides both the question and the supporting facts, enabling evaluation of both answer quality and evidence selection. The "bridge" subset is particularly challenging, requiring systems to first retrieve one passage, extract an intermediate entity, and then retrieve a second passage about that entity -- a two-hop retrieval chain that tests genuine multi-step reasoning.
- FEVER (Fact Extraction and VERification) (Thorne et al., 2018) (Thorne et al., 2018): Requires systems to classify claims as Supported, Refuted, or Not Enough Info based on Wikipedia evidence, testing both retrieval and reasoning about evidence. The best systems achieve over 90% label accuracy but rely heavily on retrieval quality: when the correct evidence is retrieved, classification is relatively easy; the challenge is finding the right evidence among millions of passages.
- Natural Questions (Kwiatkowski et al., 2019) (Kwiatkowski et al., 2019): Real Google search queries with answers from Wikipedia, providing a realistic evaluation of open-domain question answering. Natural Questions distinguishes between short answers (specific entities or phrases) and long answers (paragraph-level), testing different granularities of retrieval and extraction.
- TriviaQA (Joshi et al., 2017) (Joshi et al., 2017): Large-scale QA dataset with question-evidence pairs from trivia websites, testing retrieval over large document collections. The questions often require specialized knowledge, making parametric-only approaches insufficient for many queries.
- MuSiQue (Trivedi et al., 2022) (Trivedi et al., 2022): Multi-hop questions that are specifically designed to be unanswerable by single-hop retrieval, requiring genuine multi-step reasoning. MuSiQue includes "answerable" and "unanswerable" variants, where the unanswerable questions have one supporting fact replaced with a contradicting one, testing whether systems can distinguish between sufficient and insufficient evidence.
- MMLU (Hendrycks et al., 2021) (Hendrycks et al., 2021): While primarily a knowledge benchmark, MMLU tests whether retrieval augmentation can improve performance on 57 academic subjects, revealing how much of LLM performance comes from parametric vs. retrievable knowledge. Retrieval augmentation typically improves performance by 2-5% on knowledge-intensive subjects (history, science) but has minimal effect on reasoning-intensive subjects (math, logic).
Retrieval-Augmented Generation Benchmarks
- CRAG (Comprehensive RAG Benchmark) (Yang et al., 2024) (Yang et al., 2024): A purpose-built benchmark for evaluating end-to-end RAG systems across eight domains (finance, sports, music, movies, etc.) with questions of varying complexity (simple factoid, comparison, aggregation, post-processing). CRAG provides mock API access to web search and knowledge graphs, enabling reproducible evaluation of RAG pipelines.
- RAGAS (Es et al., 2024) (Es et al., 2024): A framework for reference-free evaluation of RAG systems, measuring faithfulness (are generated answers grounded in retrieved context?), answer relevancy (does the answer address the question?), and context relevancy (is the retrieved context useful?). RAGAS enables evaluation without ground-truth answers, using LLM judges to assess generation quality.
- RGB (Retrieval-Augmented Generation Benchmark) evaluates RAG systems specifically on their ability to handle noisy retrieval, counterfactual information, and outdated knowledge -- common failure modes in deployed RAG systems.
Agent Benchmarks
- WebArena (Zhou et al., 2024): Realistic web interaction tasks on self-hosted websites covering shopping, forums, code management, and maps. Tests complex multi-step web navigation and task completion. The best LLM agents (GPT-4 based) achieve approximately 14% task success rate, compared to 78% for humans, highlighting the substantial gap between current agents and human-level web interaction.
- VisualWebArena (Koh et al., 2024): Extension of WebArena requiring visual understanding of web pages (interpreting images, charts, visual layouts). Multimodal agents achieve only 16% success rate, revealing that visual grounding on web pages remains a major challenge.
- Mind2Web (Deng et al., 2024): Real-world web task completion across 137 diverse websites, with human action traces for training. The benchmark covers tasks of varying complexity across three domains: travel, shopping, and information search.
- GAIA (Mialon et al., 2024) (Mialon, 2024): General AI assistant benchmark requiring multi-step tool use and reasoning, designed to be easy for humans but hard for AI. GAIA questions are stratified into three difficulty levels: Level 1 (single-step, ~5 tools), Level 2 (multi-step, ~10 tools), and Level 3 (complex multi-step with implicit constraints). Human annotators achieve 92% accuracy overall, while the best AI systems score below 40% on Level 3, making GAIA a robust benchmark for measuring progress toward general-purpose AI agents.
- SWE-bench (Jimenez et al., 2024) (Jimenez et al., 2024): Software engineering tasks requiring agents to find and fix real GitHub issues by navigating codebases, understanding code, and producing correct patches. SWE-Agent (Yang et al., 2024) and OpenHands (Wang et al., 2024) are leading agent frameworks for this benchmark. The benchmark includes 2,294 pull requests from 12 popular Python repositories, with each task requiring the agent to understand the issue description, locate relevant code across large repositories (sometimes millions of lines), and produce a correct patch.
- SWE-bench Verified (OpenAI, 2024): A curated subset of 500 tasks from SWE-bench where each has been human-verified to be solvable and have correct test cases, providing more reliable evaluation than the original benchmark. Top agents achieve approximately 50% resolve rate on Verified, compared to roughly 20% on the full SWE-bench.
- HumanEval (Chen et al., 2021) (Chen et al., 2021): Function-level code generation benchmark measuring pass@k accuracy, where k controls the search budget. The gap between pass@1 and pass@100 directly measures the value of search in code generation.
- AgentBench (Liu et al., 2024) (Liu et al., 2024): A multi-dimensional benchmark evaluating LLM agents across 8 distinct environments including web shopping, database operations, interactive coding, knowledge graph reasoning, and operating system interaction, providing the most comprehensive evaluation of cross-domain agent capabilities.
- OS-World (Xie et al., 2024) (Xie et al., 2024): Tests agents on real operating system tasks (Ubuntu, Windows, macOS) including file management, application use, and system configuration. Current best agents achieve only 12.24% success rate versus 72.36% for humans, demonstrating that computer-use agents remain far from practical reliability.
- Tau-bench (Yao et al., 2024): Tests agents on simulated customer service scenarios requiring tool use, policy compliance, and multi-turn dialogue, measuring both task completion and adherence to business rules -- a critical dimension for deployed agents.
Evaluation Challenges
Evaluating agentic search systems presents unique challenges beyond those in standard NLP evaluation:
Trajectory evaluation: The quality of an agentic search system depends not just on its final answer but on the efficiency and correctness of its search trajectory. Two systems may arrive at the same answer, but one may do so with 3 search queries while the other uses 30 -- the former is clearly superior. Metrics that evaluate only the final answer miss this critical dimension. Recent work on process reward models (PRMs) (Lightman et al., 2023) addresses this by evaluating intermediate reasoning steps, but extending PRMs to agentic search trajectories (which include heterogeneous actions like queries, tool calls, and reasoning steps) remains an open problem.
Reproducibility: Agentic search results depend on the content of web pages and search engine results at the time of evaluation, which change constantly. This makes exact reproduction of results impossible and complicates longitudinal comparison of methods. Benchmarks like WebArena and CRAG partially address this by using self-hosted or snapshot-based environments, but they sacrifice the realism of live web search. The fundamental tension between reproducibility and ecological validity is a defining challenge for the field.
Open-ended evaluation: Many agentic search tasks (research questions, synthesis tasks) have no single correct answer, making automatic evaluation difficult. Human evaluation is expensive and subjective. LLM-as-judge approaches provide a scalable alternative but have their own biases -- they tend to prefer longer, more detailed answers regardless of accuracy, and they exhibit position bias (preferring the first response in pairwise comparisons) (Zheng et al., 2023). Emerging approaches use multi-judge panels and calibration against human ratings to improve reliability.
Cost accounting: Different systems may use different numbers of API calls, different model sizes, and different amounts of retrieved context. Fair comparison requires accounting for the total computational cost, not just the quality of the final output. A system achieving 90% accuracy with 10 API calls is arguably superior to one achieving 92% accuracy with 200 API calls, but current evaluation frameworks rarely capture this distinction formally.
Safety and harm evaluation: Agentic search systems that browse the web and execute tools pose unique safety risks: they may encounter and propagate misinformation, leak private information through search queries, or take harmful actions when given tool access. Evaluating these risks requires red-teaming and adversarial testing beyond standard accuracy metrics. The challenge is particularly acute for systems with real-world tool access (code execution, web browsing, API calls), where mistakes can have consequences beyond incorrect answers.
References
- Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba (2021). Evaluating Large Language Models Trained on Code. arXiv.
- Xiang Deng, Yu Gu, Boyuan Zheng (2024). Mind2Web: Towards a Generalist Agent for the Web. NeurIPS.
- Shahul Es, Jithin James, Luis Espinosa-Anke, Steven Schockaert (2024). RAGAS: Automated Evaluation of Retrieval Augmented Generation. EACL.
- Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt (2021). Measuring Massive Multitask Language Understanding. ICLR.
- Carlos E. Jimenez, John Yang, Alexander Wettig (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. ICLR.
- Mandar Joshi, Eunsol Choi, Daniel S. Weld, Luke Zettlemoyer (2017). TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. ACL.
- Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP.
- Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried (2024). VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. ACL.
- Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield (2019). Natural Questions: A Benchmark for Question Answering Research. TACL.
- Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe (2023). Let's Verify Step by Step. ICLR.
- Sheng-Chieh Lin, Akari Asai, Minghan Li, Barlas Oguz, Jimmy Lin, Yashar Mehdad, Wen-tau Yih, Xilun Chen (2023). How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval. EMNLP Findings.
- Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang (2024). AgentBench: Evaluating LLMs as Agents. ICLR.
- Gregoire Mialon (2024). GAIA: A Benchmark for General AI Assistants. ICLR.
- Niklas Muennighoff, Nouamane Tazi, Loic Magne, Nils Reimers (2023). MTEB: Massive Text Embedding Benchmark. EACL.
- Fabio Petroni, Aleksandra Piktus, Angela Fan (2021). KILT: A Benchmark for Knowledge Intensive Language Tasks. NAACL.
- Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, Tao Yu (2023). One Embedder, Any Task: Instruction-Finetuned Text Embeddings. ACL Findings.
- Nandan Thakur, Nils Reimers, Andreas Ruckle, Abhishek Srivastava, Iryna Gurevych (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS.
- James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Arpit Mittal (2018). FEVER: A Large-scale Dataset for Fact Extraction and VERification. NAACL.
- Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, Ashish Sabharwal (2022). MuSiQue: Multihop Questions via Single Hop Question Composition. TACL.
- Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Zilong Wang, Likun Liu, Junyang Lin, Graham Neubig (2024). OpenHands: An Open Platform for AI Software Developers as Generalist Agents. arXiv.
- Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. NeurIPS.
- Zhilin Yang, Peng Qi, Saizheng Zhang (2018). HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. EMNLP.
- Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Ber Dornfeld, Chenxi Huang, Lester Kim (2024). CRAG -- Comprehensive RAG Benchmark. arXiv.
- John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, Ofir Press (2024). SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. arXiv.
- Lianmin Zheng, Wei-Lin Chiang, Ying Sheng (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS.
- Shuyan Zhou, Frank F. Xu, Hao Zhu (2024). WebArena: A Realistic Web Environment for Building Autonomous Agents. ICLR.