Search in Code & Math
Search in Code & Math
Code generation and mathematical reasoning represent some of the most compelling applications of agentic search, because they have verifiable outcomes: a generated program either passes its tests or it does not, and a mathematical proof either verifies or it does not. This verifiability enables clean evaluation and provides strong training signals for self-improvement.
AlphaCode and Competitive Programming
Li et al. (2022) (Li et al., 2022) introduced AlphaCode (DeepMind), which uses large-scale code generation combined with filtering and clustering to solve competitive programming problems. The search strategy is "massive generation followed by intelligent selection": AlphaCode generates up to 1 million candidate solutions per problem, filters them using the provided test cases, clusters the remaining correct solutions (using a learned similarity model), and submits representatives from each cluster. This "generate-and-filter" approach achieved competitive programming performance at the level of a median participant on Codeforces (top 54.3%), demonstrating that search over a massive solution space, guided by test-based verification, can compensate for imperfect individual solution quality.
AlphaCode 2 (2023) (DeepMind, 2023) improved dramatically by using Gemini as the base model and incorporating better search strategies, achieving performance at the 85th percentile of Codeforces participants. The key improvements included more targeted sampling (using a learned scoring model to allocate generation budget to promising solution types) and better clustering (using semantic similarity to identify distinct solution strategies).
Codex and HumanEval. The HumanEval benchmark (Chen et al., 2021) (Chen et al., 2021) has become the standard evaluation for code generation, measuring functional correctness through unit test execution. Codex (Chen et al., 2021), the model behind GitHub Copilot, established that pre-training on code produces models capable of synthesizing correct programs from docstrings. The pass@k metric (probability of at least one correct solution in k samples) directly connects code generation to search: higher k means more search budget, and the gap between pass@1 and pass@100 measures how much search helps. For Codex-12B, pass@1 was 28.8% while pass@100 reached 72.3%, showing that search can more than double effective performance.
APPS and CodeContests. Hendrycks et al. (2021) (Hendrycks et al., 2021) introduced APPS, a benchmark of 10,000 coding problems from competitive programming sites spanning introductory to competition-level difficulty. APPS revealed a stark capability gap: models that achieve reasonable performance on easy problems (e.g., 20-30% pass@5) achieve near-zero on competition-level problems, demonstrating that difficulty scaling is a fundamental challenge. CodeContests (Li et al., 2022) provided a curated subset of competitive programming problems with more rigorous test suites.
SWE-bench and Software Engineering Agents
Jimenez et al. (2024) (Jimenez et al., 2024) introduced SWE-bench, a benchmark of real-world GitHub issues from popular Python repositories (Django, Flask, scikit-learn, etc.), where agents must find and fix actual bugs by navigating codebases, understanding code, and producing correct patches. SWE-bench has become the standard evaluation for software engineering agents.
SWE-Agent (Yang et al., 2024) (Yang et al., 2024) designed a custom agent-computer interface optimized for LLM-based software engineering. SWE-Agent provides the LLM with specialized commands for navigating codebases (search, open files, scroll), editing code (edit specific lines), and running tests, achieving significantly better performance than agents using raw bash commands. SWE-Agent resolved 12.5% of SWE-bench issues (later improved with better models).
OpenHands (Wang et al., 2024) (Wang et al., 2024) (formerly OpenDevin) provides a framework for building software development agents that can browse the web, write code, and execute commands. OpenHands supports multiple agent architectures (CodeAct, browsing agents) and has achieved state-of-the-art performance on SWE-bench.
The rapid progress on SWE-bench -- from ~3% resolution rate to ~50%+ with frontier models and sophisticated agent architectures -- demonstrates that agentic search over codebases is becoming practically useful for automated software engineering.
Proof Search in Formal Mathematics
Searching for mathematical proofs in formal languages (Lean, Isabelle, Coq) has become an active area combining LLMs with structured search. The proof search problem is: given a theorem statement, find a sequence of proof steps (tactics) that proves it, where each step transforms the proof state toward a verified conclusion.
ReProver (Yang et al., 2023) (Yang, 2023) proposed a retrieval-augmented approach to theorem proving in Lean, using a retriever to find relevant lemmas from the mathematical library and a generator to produce proof tactics. The combination of retrieval (finding useful building blocks) with generation (constructing proof steps) significantly outperforms generation alone.
DeepSeek-Prover (Xin et al., 2024) (Xin, 2024) uses a large language model with Monte Carlo Tree Search to explore the proof space systematically. DeepSeek-Prover-V1.5 combined MCTS with a process reward model that evaluates intermediate proof states, achieving near-state-of-the-art performance on formal mathematics benchmarks. DeepSeekMath (Shao et al., 2024) applied similar MCTS + RL techniques to informal mathematical reasoning, demonstrating that search-guided training substantially outperforms standard supervised learning.
AlphaProof (Google DeepMind, 2024) (DeepMind, 2024) combined Gemini with reinforcement learning and tree search to solve International Mathematical Olympiad problems in Lean 4. At IMO 2024, AlphaProof (combined with AlphaGeometry 2) achieved a silver medal-level score, solving 4 of 6 problems. This result demonstrated that the combination of LLM-based generation with formal verification and systematic search can achieve mathematical reasoning at an elite level.
Process Reward Models for Math Reasoning
A key enabler for search in mathematical reasoning is the process reward model (PRM), which evaluates intermediate reasoning steps rather than only final answers. Lightman et al. (2023) (Lightman et al., 2023) demonstrated that PRMs trained on step-by-step annotations outperform outcome reward models (ORMs) for guiding search in mathematical reasoning. When combined with best-of-N sampling or tree search, PRMs enable the model to identify and backtrack from incorrect reasoning steps, substantially improving accuracy on competition math benchmarks (MATH dataset accuracy improved from 50% to 78% with PRM-guided search). DeepSeekMath (2024) (Shao et al., 2024) further demonstrated the effectiveness of RL training with process rewards for mathematical reasoning at scale, achieving strong performance on competition-level math problems through GRPO training combined with step-level verification.
References
- Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba (2021). Evaluating Large Language Models Trained on Code. arXiv.
- Google DeepMind (2023). AlphaCode 2 Technical Report. Google DeepMind.
- Google DeepMind (2024). AI Achieves Silver-Medal Standard Solving International Mathematical Olympiad Problems. Google DeepMind Blog.
- Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, Jacob Steinhardt (2021). Measuring Coding Challenge Competence With APPS. NeurIPS.
- Carlos E. Jimenez, John Yang, Alexander Wettig (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. ICLR.
- Yujia Li, David Choi, Junyoung Chung (2022). Competition-Level Code Generation with AlphaCode. Science.
- Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe (2023). Let's Verify Step by Step. ICLR.
- Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv.
- Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Zilong Wang, Likun Liu, Junyang Lin, Graham Neubig (2024). OpenHands: An Open Platform for AI Software Developers as Generalist Agents. arXiv.
- Huajian Xin (2024). DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data. arXiv.
- Kaiyu Yang (2023). LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. NeurIPS.
- John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, Ofir Press (2024). SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. arXiv.