Tool-Augmented Retrieval

Tool-augmented retrieval represents the foundational layer of agentic search: endowing LLMs with the ability to invoke external tools (search engines, calculators, APIs, databases) as part of their generation process. This transforms the LLM from a passive generator into an active information seeker.

WebGPT

Nakano et al. (2021) (Nakano, 2021) developed WebGPT, one of the first systems to train an LLM to use a web search engine as a tool. WebGPT was fine-tuned via human feedback to issue search queries, click on results, scroll through pages, quote passages, and compose answers with citations. The training process used both behavior cloning (imitating human search trajectories) and reinforcement learning from human feedback (RLHF, optimizing for human-rated answer quality).

WebGPT demonstrated several important findings: (1) LLMs can learn effective search strategies that rival human searchers, often outperforming humans on the ELI5 long-form question-answering benchmark; (2) RLHF-trained search strategies produce better answers than behavior cloning alone, suggesting that the model discovers search strategies not present in the human demonstrations; (3) the best WebGPT model (175B with best-of-64 sampling) produced answers that human raters preferred 69% of the time over the top-voted Reddit answer on ELI5, and 56% of the time over the human demonstrators, showing that tool-augmented LLMs can match or modestly exceed human demonstrators on this preference metric.

Toolformer

Schick et al. (2023) (Schick et al., 2023) introduced Toolformer, which teaches LLMs to use external tools in a self-supervised manner, eliminating the need for human demonstrations. The key idea is to let the LLM annotate its own training data with tool calls: (1) generate candidate tool-call insertions at each position in the training text, (2) execute the tool calls, (3) retain only the tool calls that reduce the model's perplexity on subsequent tokens (i.e., the tool call actually helps the model's predictions). The model is then fine-tuned on this self-annotated dataset.

Toolformer demonstrated that a 6.7B parameter model with tool access (search engine, calculator, calendar, translation API, question-answering system) can outperform a 175B parameter model without tools on math and reasoning benchmarks (SVAMP, ASDiv, MAWPS) and LAMA-style factual completion. Notably, the gains do not extend uniformly: on open-domain QA benchmarks (TriviaQA, WebQuestions, Natural Questions), Toolformer still falls short of GPT-3, which the authors attribute to its weak built-in search engine. This result underscores a key insight: where the tools are strong, tool augmentation is a form of efficiency, since using external tools allows a smaller model to match a much larger one because the tools provide capabilities (precise calculation, up-to-date information) that would otherwise require enormous parametric capacity.

ReAct: Reasoning + Acting

Yao et al. (2023) (Yao et al., 2023) proposed ReAct, a prompting framework where the LLM alternates between reasoning traces (internal thoughts about what to do and why) and actions (search queries, lookups, tool invocations). By interleaving reasoning with action, ReAct enables more systematic and interpretable search strategies than pure action-based approaches. A ReAct trace looks like:

Thought: I need to find the population of Tokyo to answer this comparison question. Action: Search[population of Tokyo 2024] Observation: Tokyo has a population of approximately 14 million in the city proper... Thought: Now I need the population of New York for comparison...

ReAct has become a foundational pattern for LLM-based agents, and its influence extends well beyond retrieval: the interleaving of reasoning and action is now the standard paradigm for autonomous LLM agents across domains (web browsing, code generation, task completion). The key insight is that explicit reasoning traces help the model maintain state, plan ahead, and recover from errors, capabilities that are diminished when the model acts without thinking.

TALM: Tool Augmented Language Models

Parisi et al. (2022) (Parisi et al., 2022) proposed TALM, which iteratively fine-tunes language models on tool-use examples generated by the model itself. Starting from a few seed examples of tool use, TALM generates additional tool-use examples, filters them for correctness, and uses the filtered examples for further fine-tuning. This bootstrapping approach scales more easily than human-supervised methods and demonstrated that tool-use capability improves with both model scale and the number of bootstrapping iterations. Compared to WebGPT, which relies on costly human trajectories and reward modeling, and to Toolformer, whose self-supervised filtering is a single pass over the corpus, TALM occupies a middle ground: it amortizes supervision through repeated self-generation and filtering, trading per-example human cost for additional training iterations.

Function Calling and Structured Tool Use

The practical deployment of tool-augmented LLMs has converged on function calling, a structured interface where the model outputs a JSON-formatted function call (specifying the tool name and arguments), the system executes the function, and the result is fed back to the model. OpenAI's function calling API, Anthropic's tool use, and Google's function calling have standardized this interface, making tool augmentation accessible to application developers.

Gorilla (Patil et al., 2023) (Patil et al., 2023) specifically trained an LLM to generate accurate API calls from natural language instructions, addressing the challenge that general-purpose LLMs often generate syntactically correct but semantically incorrect API calls. By fine-tuning on a large dataset of API documentation and usage examples, Gorilla (a 7B LLaMA model with Retriever-Aware Training) improved AST accuracy by approximately 20.4% over GPT-4 across Torch Hub, HuggingFace, and TensorFlow Hub.

ToolLLM / ToolBench (Qin et al., 2023) (Qin et al., 2024) provided a comprehensive benchmark and training framework for tool-augmented LLMs, covering 16,464 real-world APIs from RapidAPI across 49 categories. The work demonstrated that fine-tuning on diverse tool-use examples produces models that generalize to unseen tools, and that multi-step tool-use planning via a depth-first search-based decision tree (DFSDT) over tool chains significantly outperforms single-step tool selection. Together, these systems trace a shift across supervision paradigms: from human-supervised behavior cloning plus RLHF (WebGPT), to self-supervised annotation (Toolformer), to model-bootstrapped fine-tuning (TALM), and finally to large-scale instruction tuning over API corpora (Gorilla, ToolLLM) backed by standardized function-calling interfaces.

Comparing the Families

The systems above differ primarily in where their supervision comes from, how expensive that supervision is, and whether tool use is one-shot or iterative within a single query. The following table contrasts them along these axes:

System	Supervision source	Relative training cost	Tool use within a query
WebGPT	Human demonstrations + RLHF	High (human trajectories and reward model)	Iterative (multi-step browsing)
Toolformer	Self-supervised (perplexity-filtered self-annotation)	Moderate (single annotation pass, then fine-tune)	One-shot per insertion point
ReAct	None (prompting only)	None (inference-time)	Iterative (interleaved reason and act)
TALM	Model-bootstrapped self-training	Moderate (repeated generate, filter, fine-tune)	Iterative
Gorilla	Instruction tuning on API docs	Moderate (supervised fine-tuning)	One-shot API call
ToolLLM	Instruction tuning + tree search	High (large API corpus, search-augmented)	Iterative (DFSDT over tool chains)

Two trends stand out. First, supervision has moved away from costly human trajectories toward self- and model-generated data, lowering the cost of teaching new tools. Second, the field has shifted from one-shot tool calls toward iterative, planned tool use, which is what makes downstream agentic search possible.

Limitations of Tool-Augmented Approaches

Current tool-augmented systems face several challenges: (1) tool hallucination, generating calls to nonexistent tools or using incorrect arguments, especially for unfamiliar APIs; (2) error recovery, where a tool call fails or returns unexpected results and the model must interpret the error and reformulate, which current models do imperfectly; (3) tool composition, effectively chaining multiple tool calls so that the output of one tool feeds into the input of another, which requires planning capabilities that are still developing; and (4) security, since allowing LLMs to execute arbitrary tool calls raises safety concerns, particularly for tools with side effects (sending emails, executing code, making purchases).