Skip to main content

Benchmarks & Evaluation

Benchmarks & Evaluation

Evaluating agentic search systems is challenging because they operate at the intersection of retrieval, reasoning, generation, and tool use. Different benchmarks test different capabilities, and no single benchmark captures the full range of agentic search behaviors.

Retrieval Benchmarks

  • KILT (Knowledge Intensive Language Tasks) (Petroni et al., 2021) (Petroni et al., 2021): A unified benchmark covering fact-checking, question answering, slot filling, entity linking, and dialogue, all requiring retrieval from Wikipedia. KILT's unified format enables fair comparison across diverse retrieval-augmented tasks. KILT evaluates both retrieval quality (R-precision, recall@5) and downstream task performance, revealing that retrieval quality is often the bottleneck: systems with identical readers but different retrievers show large performance gaps.
  • BEIR (Benchmarking IR) (Thakur et al., 2021) (Thakur et al., 2021): A heterogeneous benchmark of 18 retrieval datasets spanning diverse domains (biomedical, financial, scientific) and task types (passage retrieval, question answering, fact verification). BEIR evaluates zero-shot retrieval performance, testing whether retrieval models generalize across domains without domain-specific training. A key finding from BEIR is that dense retrievers like DPR (Karpukhin et al., 2020), despite strong in-domain performance, often underperform BM25 on out-of-domain datasets -- motivating the development of more robust dense retrievers like DRAGON (Lin et al., 2023) and instruction-tuned retrievers like INSTRUCTOR (Su et al., 2023).
  • MTEB (Massive Text Embedding Benchmark) (Muennighoff et al., 2023) (Muennighoff et al., 2023): The most comprehensive embedding evaluation, covering retrieval, classification, clustering, and semantic similarity across 56+ datasets in multiple languages. MTEB has become the standard leaderboard for text embedding models, revealing that no single model dominates all tasks: models tuned for retrieval may underperform on clustering, and vice versa.

Question Answering Benchmarks

  • HotpotQA (Yang et al., 2018) (Yang et al., 2018): Multi-hop question answering requiring reasoning over two or more Wikipedia paragraphs. HotpotQA provides both the question and the supporting facts, enabling evaluation of both answer quality and evidence selection. The "bridge" subset is particularly challenging, requiring systems to first retrieve one passage, extract an intermediate entity, and then retrieve a second passage about that entity -- a two-hop retrieval chain that tests genuine multi-step reasoning.
  • FEVER (Fact Extraction and VERification) (Thorne et al., 2018) (Thorne et al., 2018): Requires systems to classify claims as Supported, Refuted, or Not Enough Info based on Wikipedia evidence, testing both retrieval and reasoning about evidence. The best systems achieve over 90% label accuracy but rely heavily on retrieval quality: when the correct evidence is retrieved, classification is relatively easy; the challenge is finding the right evidence among millions of passages.
  • Natural Questions (Kwiatkowski et al., 2019) (Kwiatkowski et al., 2019): Real Google search queries with answers from Wikipedia, providing a realistic evaluation of open-domain question answering. Natural Questions distinguishes between short answers (specific entities or phrases) and long answers (paragraph-level), testing different granularities of retrieval and extraction.
  • TriviaQA (Joshi et al., 2017) (Joshi et al., 2017): Large-scale QA dataset with question-evidence pairs from trivia websites, testing retrieval over large document collections. The questions often require specialized knowledge, making parametric-only approaches insufficient for many queries.
  • MuSiQue (Trivedi et al., 2022) (Trivedi et al., 2022): Multi-hop questions that are specifically designed to be unanswerable by single-hop retrieval, requiring genuine multi-step reasoning. MuSiQue includes "answerable" and "unanswerable" variants, where the unanswerable questions have one supporting fact replaced with a contradicting one, testing whether systems can distinguish between sufficient and insufficient evidence.
  • MMLU (Hendrycks et al., 2021) (Hendrycks et al., 2021): While primarily a knowledge benchmark, MMLU tests whether retrieval augmentation can improve performance on 57 academic subjects, revealing how much of LLM performance comes from parametric vs. retrievable knowledge. Retrieval augmentation typically improves performance by 2-5% on knowledge-intensive subjects (history, science) but has minimal effect on reasoning-intensive subjects (math, logic).

Retrieval-Augmented Generation Benchmarks

  • CRAG (Comprehensive RAG Benchmark) (Yang et al., 2024) (Yang et al., 2024): A purpose-built benchmark for evaluating end-to-end RAG systems across eight domains (finance, sports, music, movies, etc.) with questions of varying complexity (simple factoid, comparison, aggregation, post-processing). CRAG provides mock API access to web search and knowledge graphs, enabling reproducible evaluation of RAG pipelines.
  • RAGAS (Es et al., 2024) (Es et al., 2024): A framework for reference-free evaluation of RAG systems, measuring faithfulness (are generated answers grounded in retrieved context?), answer relevancy (does the answer address the question?), and context relevancy (is the retrieved context useful?). RAGAS enables evaluation without ground-truth answers, using LLM judges to assess generation quality.
  • RGB (Retrieval-Augmented Generation Benchmark) evaluates RAG systems specifically on their ability to handle noisy retrieval, counterfactual information, and outdated knowledge -- common failure modes in deployed RAG systems.

Agent Benchmarks

  • WebArena (Zhou et al., 2024): Realistic web interaction tasks on self-hosted websites covering shopping, forums, code management, and maps. Tests complex multi-step web navigation and task completion. The best LLM agents (GPT-4 based) achieve approximately 14% task success rate, compared to 78% for humans, highlighting the substantial gap between current agents and human-level web interaction.
  • VisualWebArena (Koh et al., 2024): Extension of WebArena requiring visual understanding of web pages (interpreting images, charts, visual layouts). Multimodal agents achieve only 16% success rate, revealing that visual grounding on web pages remains a major challenge.
  • Mind2Web (Deng et al., 2024): Real-world web task completion across 137 diverse websites, with human action traces for training. The benchmark covers tasks of varying complexity across three domains: travel, shopping, and information search.
  • GAIA (Mialon et al., 2024) (Mialon, 2024): General AI assistant benchmark requiring multi-step tool use and reasoning, designed to be easy for humans but hard for AI. GAIA questions are stratified into three difficulty levels: Level 1 (single-step, ~5 tools), Level 2 (multi-step, ~10 tools), and Level 3 (complex multi-step with implicit constraints). Human annotators achieve 92% accuracy overall, while the best AI systems score below 40% on Level 3, making GAIA a robust benchmark for measuring progress toward general-purpose AI agents.
  • SWE-bench (Jimenez et al., 2024) (Jimenez et al., 2024): Software engineering tasks requiring agents to find and fix real GitHub issues by navigating codebases, understanding code, and producing correct patches. SWE-Agent (Yang et al., 2024) and OpenHands (Wang et al., 2024) are leading agent frameworks for this benchmark. The benchmark includes 2,294 pull requests from 12 popular Python repositories, with each task requiring the agent to understand the issue description, locate relevant code across large repositories (sometimes millions of lines), and produce a correct patch.
  • SWE-bench Verified (OpenAI, 2024): A curated subset of 500 tasks from SWE-bench where each has been human-verified to be solvable and have correct test cases, providing more reliable evaluation than the original benchmark. Top agents achieve approximately 50% resolve rate on Verified, compared to roughly 20% on the full SWE-bench.
  • HumanEval (Chen et al., 2021) (Chen et al., 2021): Function-level code generation benchmark measuring pass@k accuracy, where k controls the search budget. The gap between pass@1 and pass@100 directly measures the value of search in code generation.
  • AgentBench (Liu et al., 2024) (Liu et al., 2024): A multi-dimensional benchmark evaluating LLM agents across 8 distinct environments including web shopping, database operations, interactive coding, knowledge graph reasoning, and operating system interaction, providing the most comprehensive evaluation of cross-domain agent capabilities.
  • OS-World (Xie et al., 2024) (Xie et al., 2024): Tests agents on real operating system tasks (Ubuntu, Windows, macOS) including file management, application use, and system configuration. Current best agents achieve only 12.24% success rate versus 72.36% for humans, demonstrating that computer-use agents remain far from practical reliability.
  • Tau-bench (Yao et al., 2024): Tests agents on simulated customer service scenarios requiring tool use, policy compliance, and multi-turn dialogue, measuring both task completion and adherence to business rules -- a critical dimension for deployed agents.

Evaluation Challenges

Evaluating agentic search systems presents unique challenges beyond those in standard NLP evaluation:

Trajectory evaluation: The quality of an agentic search system depends not just on its final answer but on the efficiency and correctness of its search trajectory. Two systems may arrive at the same answer, but one may do so with 3 search queries while the other uses 30 -- the former is clearly superior. Metrics that evaluate only the final answer miss this critical dimension. Recent work on process reward models (PRMs) (Lightman et al., 2023) addresses this by evaluating intermediate reasoning steps, but extending PRMs to agentic search trajectories (which include heterogeneous actions like queries, tool calls, and reasoning steps) remains an open problem.

Reproducibility: Agentic search results depend on the content of web pages and search engine results at the time of evaluation, which change constantly. This makes exact reproduction of results impossible and complicates longitudinal comparison of methods. Benchmarks like WebArena and CRAG partially address this by using self-hosted or snapshot-based environments, but they sacrifice the realism of live web search. The fundamental tension between reproducibility and ecological validity is a defining challenge for the field.

Open-ended evaluation: Many agentic search tasks (research questions, synthesis tasks) have no single correct answer, making automatic evaluation difficult. Human evaluation is expensive and subjective. LLM-as-judge approaches provide a scalable alternative but have their own biases -- they tend to prefer longer, more detailed answers regardless of accuracy, and they exhibit position bias (preferring the first response in pairwise comparisons) (Zheng et al., 2023). Emerging approaches use multi-judge panels and calibration against human ratings to improve reliability.

Cost accounting: Different systems may use different numbers of API calls, different model sizes, and different amounts of retrieved context. Fair comparison requires accounting for the total computational cost, not just the quality of the final output. A system achieving 90% accuracy with 10 API calls is arguably superior to one achieving 92% accuracy with 200 API calls, but current evaluation frameworks rarely capture this distinction formally.

Safety and harm evaluation: Agentic search systems that browse the web and execute tools pose unique safety risks: they may encounter and propagate misinformation, leak private information through search queries, or take harmful actions when given tool access. Evaluating these risks requires red-teaming and adversarial testing beyond standard accuracy metrics. The challenge is particularly acute for systems with real-world tool access (code execution, web browsing, API calls), where mistakes can have consequences beyond incorrect answers.


References