Skip to main content

Search and Self-Improvement at Inference and Training Time

Search and Self-Improvement at Inference and Training Time

This family corresponds to category 5 of the taxonomy ("Search with Planning (MCTS + LLMs)"), broadened here to cover the inference-time and training-time variants together. It groups methods that improve LLM outputs by exploring a space of candidates rather than committing to a single greedy generation. Tree search methods (ToT, RAP, LATS, AlphaProof) explore solution spaces explicitly, combining the strengths of LLMs (generating candidates) with search (evaluating and selecting among them). Sampling-based methods (best-of-N, self-consistency) explore implicitly through parallel generation and reranking. Verbal and iterative methods (Reflexion, Self-Refine) search through strategy space using natural-language feedback rather than an explicit tree. A related set of training-time bootstrapping methods (STaR, Quiet-STaR) internalizes the gains of search into model weights. We group these together because they share one principle: the model is often capable of the right answer but does not produce it on the first greedy attempt, so spending extra compute on exploration, evaluation, or self-correction recovers that latent capability.

Tree-of-Thought (ToT)

Yao et al. (2023) (Yao et al., 2023) proposed Tree-of-Thought, which generalizes chain-of-thought prompting to a tree structure where the LLM explores multiple reasoning paths. At each step, the model generates several candidate "thoughts," evaluates their promise using the LLM itself, and uses breadth-first or depth-first search to explore the most promising branches. ToT demonstrated significant improvements on tasks requiring planning and systematic exploration (e.g., Game of 24, creative writing, crossword puzzles). Compared with the MCTS-based methods below, ToT supplies its value signal entirely from LLM self-evaluation and uses a fixed breadth-first or depth-first budget, which keeps it simple but makes its search quality only as reliable as the LLM's own judgments of partial thoughts.

Reasoning via Planning (RAP)

Hao et al. (2023) (Hao et al., 2023) proposed RAP, which frames LLM reasoning as planning with a world model. The LLM serves as both the world model (predicting the results of reasoning steps) and the reasoning agent. MCTS is used to explore the reasoning tree, with the LLM providing value estimates for tree node evaluation. RAP demonstrated that treating reasoning as search significantly improves performance on mathematical and logical reasoning tasks. Relative to ToT, RAP replaces fixed-breadth traversal with MCTS, allocating its search budget adaptively to promising branches, but it still draws both its world-model predictions and its value estimates from the same LLM, so it inherits ToT's dependence on a reliable self-evaluator.

Language Agent Tree Search (LATS)

Zhou et al. (2024) (Zhou et al., 2024) proposed LATS, which unifies reasoning, acting, and planning in language agents using Monte Carlo Tree Search. LATS uses the LLM as a heuristic within MCTS to: generate candidate actions (expansion), evaluate states (simulation), and propagate values (backpropagation). External feedback (tool outputs, environment observations) provides grounding. LATS achieved state-of-the-art performance on HumanEval code generation and WebShop shopping tasks. The key distinction across these three methods is what supplies the value function: ToT and RAP rely on LLM self-evaluation alone, whereas LATS grounds its node values in external feedback (tool outputs, environment observations), which makes its search more reliable in interactive settings at the cost of many environment interactions and the high token and latency overhead of full MCTS rollouts.

Google DeepMind's AlphaProof (2024) (DeepMind, 2024) combined a language model (Gemini) with reinforcement learning and tree search to solve International Mathematical Olympiad problems. AlphaProof generates proof candidates in Lean 4 formal language, uses a value function to evaluate partial proofs, and searches through the proof space using MCTS. At IMO 2024, AlphaProof (combined with AlphaGeometry 2) achieved a silver medal-level score, demonstrating the power of search-guided mathematical reasoning.

Recent MCTS-for-LLM Methods

A wave of 2024 work applies MCTS directly to general LLM reasoning rather than to a single formal domain. rStar (Qi et al., 2024) (Qi et al., 2024) pairs a generator small model that expands an MCTS tree of human-like reasoning actions with a discriminator model that verifies the resulting paths, treating trajectories the two models agree on as more likely correct; this self-play mutual-reasoning scheme lifts small open models on GSM8K, MATH, and StrategyQA without any fine-tuning. ReST-MCTS* (Zhang et al., 2024) (Zhang et al., 2024) instead folds the search into a training loop: it uses process-reward-guided tree search to collect high-quality reasoning traces and per-step value estimates, then self-trains both the policy and the reward model on them, reporting gains over best-of-N and ToT under matched search budgets. The two illustrate the same split seen above between inference-time search (rStar) and search amortized into weights (ReST-MCTS*), now with MCTS rather than fixed-breadth traversal or simple sampling as the search backbone.

A simpler but surprisingly effective form of search is best-of-N sampling (also called rejection sampling): generate N candidate solutions, score them with a reward model or verifier, and return the best one. This approach requires no tree search infrastructure, only parallel generation and scoring. Self-consistency (Wang et al., 2023) (Wang et al., 2023) demonstrated a related principle: sampling multiple reasoning chains and selecting the answer by majority vote substantially improves accuracy, even without a trained verifier. When combined with process reward models (PRMs) that evaluate intermediate reasoning steps (Lightman et al., 2023), best-of-N sampling achieves substantial improvements over single-sample generation. For example, on the MATH dataset, best-of-N sampling with PRM scoring can improve accuracy to approximately 78%, though the exact baseline and number of samples vary across implementations. The limitations are equally clear: best-of-N's cost scales linearly with N (hundreds of samples can cost hundreds of times the compute of a single attempt), and because the reward model is only an approximation of true quality, aggressive reranking can amplify reward hacking, selecting candidates that score well under the model but are not actually correct.

The key insight is that many LLM failures are not capability limitations but sampling failures: the model is capable of producing the correct answer but does not always do so on a single attempt. Recent work has shown that chain-of-thought reasoning can emerge without explicit prompting through careful decoding strategies (Wang & Zhou, 2024). Search (whether tree search or best-of-N) addresses sampling failures by exploring more of the solution space, and the role of the reward model is to identify the correct solutions among the candidates. This perspective connects LLM reasoning to the classical explore-exploit tradeoff in reinforcement learning.

Test-Time Compute Scaling

A unifying perspective on search-augmented LLMs is test-time compute scaling: using more computation at inference time (through search, multi-step reasoning, or iterative refinement) to achieve better results. While pre-training compute follows well-characterized scaling laws, the relationship between test-time compute and output quality is a newer research direction. Early results suggest that test-time compute can be as effective as pre-training compute for reasoning tasks, a finding with profound implications for model deployment (it may be more cost-effective to run a smaller model with more search than to train a larger model) (Snell et al., 2024).

OpenAI's o1 and o3 reasoning models (OpenAI, 2024) demonstrate this principle at scale: by using extended chain-of-thought reasoning at inference time, these models achieve substantial improvements on mathematical reasoning, coding, and scientific problems compared to direct generation. The o1 model uses thousands of reasoning tokens internally, exploring multiple solution paths before producing a final answer. The specific mechanism (whether it resembles tree search, beam search, or a more sophisticated planning algorithm) has not been publicly disclosed, but the results demonstrate that scaling inference-time computation yields predictable, substantial quality improvements.

STaR: Self-Taught Reasoner

Unlike the inference-time search methods above, STaR and Quiet-STaR operate at training time: they use sampling and filtering to bootstrap better reasoning into the model's weights rather than to select a better answer at run time. Zelikman et al. (2022) (Zelikman et al., 2022) proposed STaR (Self-Taught Reasoner), a bootstrapping method where the model learns to reason by generating its own chain-of-thought rationales and fine-tuning on the ones that lead to correct answers. Starting from a few examples, STaR iteratively: (1) generates rationales for training problems, (2) filters for rationales that produce correct final answers, (3) for problems the model gets wrong, provides the correct answer as a hint and generates a "rationalization" explaining why that answer is correct, and (4) fine-tunes the model on the successful rationales and rationalizations. This bootstrapping loop enables the model to discover and internalize new reasoning strategies without any human-written rationales. Quiet-STaR (Zelikman et al., 2024) (Zelikman et al., 2024) extended this idea by training models to generate internal "thoughts" at every token position, learning to produce rationales that improve next-token prediction without any task-specific annotation. Compared with the inference-time methods, this approach amortizes the cost of search into a one-time training procedure (no extra inference-time compute is needed afterward), but it is limited by its reliance on answer-checkable training problems for the filtering step and, in the case of STaR's rationalization, by the risk of reinforcing plausible-sounding but unfaithful rationales that happen to reach the correct answer.

Reflexion: Verbal Reinforcement Learning

Shinn et al. (2023) (Shinn et al., 2023) proposed Reflexion, which enables language agents to learn from their mistakes through verbal self-reflection, a verbal counterpart to the explicit tree search and sampling methods above. After a failed attempt at a task, the agent generates a verbal reflection describing what went wrong and how to improve, then uses this reflection to guide its next attempt. Unlike traditional RL (which updates model weights), Reflexion stores reflections in an episodic memory buffer that conditions future attempts. Reflexion achieved significant improvements on HumanEval code generation (from 80% to 91% pass@1), ALFWorld decision-making, and HotpotQA question answering, demonstrating that self-reflection is a powerful form of search through strategy space. Its main weakness is that the code-generation results depend on self-generated unit tests as the feedback signal, and these tests can yield false positives (an agent may converge confidently on a solution that passes its own faulty tests but fails the real ones), which is one reason Reflexion does not uniformly dominate the base model on every code benchmark.

Self-Refine: Iterative Self-Improvement

Madaan et al. (2023) (Madaan et al., 2023) proposed Self-Refine, which uses the same LLM to iteratively generate, critique, and refine its own outputs. The process is: (1) generate an initial output, (2) provide feedback by critiquing the output, (3) refine the output based on the feedback, (4) repeat until the output meets quality criteria. Self-Refine demonstrated that this iterative self-improvement loop consistently improves output quality across diverse tasks (code optimization, math reasoning, sentiment transfer, dialogue) without any additional training. The key insight is that LLMs are often better at evaluating and critiquing outputs than at generating perfect outputs on the first attempt, so iterative refinement through self-evaluation is a form of search that exploits this asymmetry. Self-Refine and Reflexion both search through natural-language feedback rather than an explicit tree, but they differ in scope: Self-Refine refines a single output within one episode, whereas Reflexion carries reflections across episodes in an episodic memory. Both share the same fundamental limitation as ToT and RAP, namely that the feedback comes from the same model being corrected, so gains are bounded by the model's own ability to critique reliably, and self-evaluation can stall or even degrade outputs when the critic is overconfident or systematically biased.


References