Monte Carlo Tree Search + LLMs
Monte Carlo Tree Search + LLMs
Tree search methods enable systematic exploration of solution spaces, combining the strengths of LLMs (generating candidates) with search (evaluating and selecting among them).
Tree-of-Thought (ToT)
Yao et al. (2023) (Yao et al., 2023) proposed Tree-of-Thought, which generalizes chain-of-thought prompting to a tree structure where the LLM explores multiple reasoning paths. At each step, the model generates several candidate "thoughts," evaluates their promise using the LLM itself, and uses breadth-first or depth-first search to explore the most promising branches. ToT demonstrated significant improvements on tasks requiring planning and systematic exploration (e.g., Game of 24, creative writing, crossword puzzles).
Reasoning via Planning (RAP)
Hao et al. (2023) (Hao et al., 2023) proposed RAP, which frames LLM reasoning as planning with a world model. The LLM serves as both the world model (predicting the results of reasoning steps) and the reasoning agent. MCTS is used to explore the reasoning tree, with the LLM providing value estimates for tree node evaluation. RAP demonstrated that treating reasoning as search significantly improves performance on mathematical and logical reasoning tasks.
Language Agent Tree Search (LATS)
Zhou et al. (2024) (Zhou et al., 2024) proposed LATS, which unifies reasoning, acting, and planning in language agents using Monte Carlo Tree Search. LATS uses the LLM as a heuristic within MCTS to: generate candidate actions (expansion), evaluate states (simulation), and propagate values (backpropagation). External feedback (tool outputs, environment observations) provides grounding. LATS achieved state-of-the-art performance on HumanEval code generation and WebShop shopping tasks.
AlphaProof and Mathematical Search
Google DeepMind's AlphaProof (2024) (DeepMind, 2024) combined a language model (Gemini) with reinforcement learning and tree search to solve International Mathematical Olympiad problems. AlphaProof generates proof candidates in Lean 4 formal language, uses a value function to evaluate partial proofs, and searches through the proof space using MCTS. At IMO 2024, AlphaProof (combined with AlphaGeometry 2) achieved a silver medal-level score, demonstrating the power of search-guided mathematical reasoning.
Best-of-N Sampling and Reward-Guided Search
A simpler but surprisingly effective form of search is best-of-N sampling (also called rejection sampling): generate N candidate solutions, score them with a reward model or verifier, and return the best one. This approach requires no tree search infrastructure -- just parallel generation and scoring. Self-consistency (Wang et al., 2023) (Wang et al., 2023) demonstrated a related principle: sampling multiple reasoning chains and selecting the answer by majority vote substantially improves accuracy, even without a trained verifier. When combined with process reward models (PRMs) that evaluate intermediate reasoning steps (Lightman et al., 2023), best-of-N sampling achieves substantial improvements over single-sample generation. For example, on the MATH dataset, generating 256 solutions and selecting the best via PRM scoring improves accuracy from ~50% to ~78%.
The key insight is that many LLM failures are not capability limitations but sampling failures -- the model is capable of producing the correct answer but does not always do so on a single attempt. Recent work has shown that chain-of-thought reasoning can emerge without explicit prompting through careful decoding strategies (Wang & Zhou, 2024). Search (whether tree search or best-of-N) addresses sampling failures by exploring more of the solution space, and the role of the reward model is to identify the correct solutions among the candidates. This perspective connects LLM reasoning to the classical explore-exploit tradeoff in reinforcement learning.
Test-Time Compute Scaling
A unifying perspective on search-augmented LLMs is test-time compute scaling: using more computation at inference time (through search, multi-step reasoning, or iterative refinement) to achieve better results. While pre-training compute follows well-characterized scaling laws, the relationship between test-time compute and output quality is a newer research direction. Early results suggest that test-time compute can be as effective as pre-training compute for reasoning tasks -- a finding with profound implications for model deployment (it may be more cost-effective to run a smaller model with more search than to train a larger model) (Snell et al., 2024).
OpenAI's o1 and o3 reasoning models (OpenAI, 2024) demonstrate this principle at scale: by using extended chain-of-thought reasoning at inference time, these models achieve substantial improvements on mathematical reasoning, coding, and scientific problems compared to direct generation. The o1 model uses thousands of reasoning tokens internally, exploring multiple solution paths before producing a final answer. The specific mechanism -- whether it resembles tree search, beam search, or a more sophisticated planning algorithm -- has not been publicly disclosed, but the results demonstrate that scaling inference-time computation yields predictable, substantial quality improvements.
STaR: Self-Taught Reasoner
Zelikman et al. (2022) (Zelikman et al., 2022) proposed STaR (Self-Taught Reasoner), a bootstrapping method where the model learns to reason by generating its own chain-of-thought rationales and fine-tuning on the ones that lead to correct answers. Starting from a few examples, STaR iteratively: (1) generates rationales for training problems, (2) filters for rationales that produce correct final answers, (3) for problems the model gets wrong, provides the correct answer as a hint and generates a "rationalization" explaining why that answer is correct, and (4) fine-tunes the model on the successful rationales and rationalizations. This bootstrapping loop enables the model to discover and internalize new reasoning strategies without any human-written rationales. Quiet-STaR (Zelikman et al., 2024) (Zelikman et al., 2024) extended this idea by training models to generate internal "thoughts" at every token position, learning to produce rationales that improve next-token prediction without any task-specific annotation.
Reflexion: Verbal Reinforcement Learning
Shinn et al. (2023) (Shinn et al., 2023) proposed Reflexion, which enables language agents to learn from their mistakes through verbal self-reflection. After a failed attempt at a task, the agent generates a verbal reflection describing what went wrong and how to improve, then uses this reflection to guide its next attempt. Unlike traditional RL (which updates model weights), Reflexion stores reflections in an episodic memory buffer that conditions future attempts. Reflexion achieved significant improvements on HumanEval code generation (from 80% to 91% pass@1), ALFWorld decision-making, and HotpotQA question answering, demonstrating that self-reflection is a powerful form of search through strategy space.
Self-Refine: Iterative Self-Improvement
Madaan et al. (2023) (Madaan et al., 2023) proposed Self-Refine, which uses the same LLM to iteratively generate, critique, and refine its own outputs. The process is: (1) generate an initial output, (2) provide feedback by critiquing the output, (3) refine the output based on the feedback, (4) repeat until the output meets quality criteria. Self-Refine demonstrated that this iterative self-improvement loop consistently improves output quality across diverse tasks (code optimization, math reasoning, sentiment transfer, dialogue) without any additional training. The key insight is that LLMs are often better at evaluating and critiquing outputs than at generating perfect outputs on the first attempt -- so iterative refinement through self-evaluation is a form of search that exploits this asymmetry.
References
- Google DeepMind (2024). AI Achieves Silver-Medal Standard Solving International Mathematical Olympiad Problems. Google DeepMind Blog.
- Shibo Hao, Yi Gu, Haodi Ma (2023). Reasoning with Language Model is Planning with World Model. EMNLP.
- Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe (2023). Let's Verify Step by Step. ICLR.
- Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, Peter Clark (2023). Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS.
- OpenAI (2024). Learning to Reason with LLMs. OpenAI Blog.
- Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS.
- Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv.
- Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR.
- Xuezhi Wang, Denny Zhou (2024). Chain-of-Thought Reasoning Without Prompting. arXiv.
- Shunyu Yao, Dian Yu, Jeffrey Zhao (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS.
- Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman (2022). STaR: Bootstrapping Reasoning With Reasoning. NeurIPS.
- Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, Noah D. Goodman (2024). Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking. arXiv.
- Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman (2024). Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models. ICML.