Recently I'm particularly interested in math theorem proving in formal language, and its applications (e.g. how it can benefit natural language reasoning).
SURGE evaluates LLMs’ ability to predict code execution across eight key areas. While they show promise, limitations prevent general-purpose surrogate execution.
Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving
The new 7B Goedel-Prover sets a new state-of-the-art in open-source automated theorem proving, beating previous records with a 7% improvement on miniF2F, topping the PutnamBench Leaderboard, and solving nearly twice as many problems on Lean Workbook.
Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation
Bohan Lyu*, Yadi Cao*, Duncan Watson-Parris, Leon Bergen, Taylor Berg-Kirkpatrick, Rose Yu
AAAI FSS ORAL, Main Conference under review, 2024
arxiv /
slides /
youtube /
This work proposes a fine-tuning method where LLMs internalize tool-generated solutions (World Knowledge Distillation) and learn to switch between direct answers and tool use for complex problems (Tool Usage Adaptation). It outperforms GPT-4 and Claude-3.5 across six scientific benchmarks.
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
MEGA-Bench contains 505 multimodal tasks with diverse data sources, input/output formats, and skill requirements. The benchmark is equiped with a suite of 45 evaluation metrics to handle various output formats beyond multiple-choice questions.
VIDEOSCORE: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
Xuan He*, Dongfu Jiang*, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhraneil Chandra, Ziyan Jiang, Aaran Arulraj, Kai Wang, Quy Duc Do, Yuansheng Ni, Bohan Lyu, Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Yuchen Lin, Wenhu Chen
EMNLP (Main), 2024
arxiv /
website /
We release VIDEOFEEDBACK, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos from 11 existing video generative models. We train VIDEOSCORE based on VIDEOFEEDBACK to enable automatic video quality assessment.
Exploring Diffusion Models’ Corruption Stage in Few-Shot Fine-tuning and Mitigating with Bayesian Neural Networks
Xiaoyu Wu*, Jiaru Zhang*, Yang Hua, Bohan Lyu, Hao Wang, Tao Song, Haibing Guan
Under Peer Review, 2024
arxiv /
We apply Bayesian Neural Networks (BNNs) on Diffusion Models (DMs) with variational inference to implicitly broaden the learned distribution, and present that the learning target of the BNNs can be naturally regarded as an expectation of the diffusion loss and a further regularization with the pretrained DMs.
Enhancing LLM’s Capabilities in Open Domains via Autonomous Tool Integration
Bohan Lyu*, Xin Cong*, Heyang Yu, Pan Yang, Yujia Qin, Yining Ye, Yaxi Lu, Zhong Zhang, Yukun Yan, Yankai Lin, Zhiyuan Liu, Maosong Sun
Under Peer Review, 2023
arxiv /
Developed an autonomous agent that leverages GitHub repositories to extend its capabilities to address diverse user queries. Introduced a new agent architecture that achieved SOTA performance on SciAct.