Bohan Lyu

I am an undergruduate at Tsinghua University. I'm interested in ML and NLP topics. My works are published in ICML and ACL.

Now I'm working on LLMs for autonomous math theorem proving with formal language at PLI, Princeton, advised by Prof. Prof. Chi Jin. Prior to it, I've worked in Tsinghua NLP Lab, advised by Prof. Zhiyuan Liu, and Rose-STL-Lab, UCSD, advised by Prof. Rose Yu. I've also interned as a quantitative researcher at China Securities.

I have served as a reviewer for ARR Feb. 2025, ICLR 2025, AI4MATH @ ICML 2024 and LLMAgents @ ICLR 2024 .

I am seeking PhD opportunities starting in Fall 2026. Please feel free to reach out!

Email / GitHub / Google Scholar / Twitter / CV

▶ Contact Info (click to see)

News

2025-05 Enhancing LLM’s Capabilities in Open Domains via Autonomous Tool Integration is accepted to ACL 2025 Main!
2025-05 Adapting while Learning is accepted to ICML 2025!
2024-12 Delivered a talk at the Challenge Cup Seminar hosted by Department of Automation, Tsinghua University.
2024-11 Gave a talk on my summer project at AAAI FSS.
2024-06 My team got No.1 out of 600+ teams at a Baidu Inc. Data Mining Competition.
2024-05 Awarded Spark Scientific and Technological Innovation Fellowship (Top 1% in Tsinghua).
2024-04 I led my team to win Second Place and the Newcomer Prize at Tsinghua University's Challenge Cup.

Research (* indicates equal contribution, highlight indicates representative papers)

I'm interested in Natural Language Processing, AI4Science and multimodal learning.

	Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities Haoyu Zhao, Yihan Geng, Shange Tang, Yong Lin, Bohan Lyu, Hongzhou Lin, Chi Jin, Sanjeev Arora Preprint arXiv / Code / BibTeX / LLM-based proof assistants struggle with compositional reasoning in inequalities. The Ineq-Comp benchmark reveals that even strong models like DeepSeek-Prover-V2-7B falter, despite having proofs of subparts. This highlights a key gap between AI and human mathematical intuition.
	SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors Bohan Lyu, Siqiao Huang, Zichen Liang, Qi-An Sun, Jiaming Zhang Preprint* arXiv / Code / BibTeX / SURGE evaluates LLMs’ ability to predict code execution across eight key areas. While they show promise, limitations prevent general-purpose surrogate execution. This is an extension of a course project that does not reflect the focus of my research.
	Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, Chi Jin Preprint arXiv / Code / Website / YouTube / BibTeX / The new 7B Goedel-Prover sets a new state-of-the-art in open-source automated theorem proving, beating previous records with a 7% improvement on miniF2F, topping the PutnamBench Leaderboard, and solving nearly twice as many problems on Lean Workbook.
	Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation Bohan Lyu, Yadi Cao, Duncan Watson-Parris, Leon Bergen, Taylor Berg-Kirkpatrick, Rose Yu ICML 2025, AAAI 2024 FSS (ORAL) arXiv / Code / Slides / Website / YouTube / BibTeX / This work proposes a fine-tuning method where LLMs internalize tool-generated solutions (World Knowledge Distillation) and learn to switch between direct answers and tool use for complex problems (Tool Usage Adaptation). It outperforms GPT-4 and Claude-3.5 across six scientific benchmarks.
	Enhancing LLM’s Capabilities in Open Domains via Autonomous Tool Integration Bohan Lyu, Xin Cong, Heyang Yu, Pan Yang, Yujia Qin, Yining Ye, Yaxi Lu, Zhong Zhang, Yukun Yan, Yankai Lin, Zhiyuan Liu, Maosong Sun ACL 2025 (Main) arXiv / Code / Slides / YouTube / BibTeX / Constructed OpenAct benchmark for complex open-domain task-solving. Developed a novel LLM agent system, OpenAgent, which leverages GitHub repositories to extend its capabilities to address diverse user queries.

Resources

Some tools and resources I've developed for the research community:

jload - A Python package for convenient JSON/JSONL data loading and saving.
vlllm - A Python package for convenient text generation with multiprocessing support.

Design and source code from Leonid Keselman's website

Bohan Lyu

News

Research (* indicates equal contribution, highlight indicates representative papers)

Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities

SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors

Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving

Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation

Enhancing LLM’s Capabilities in Open Domains via Autonomous Tool Integration

Resources

Please cite if you find it helpful.