NICE Talk: On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

主办人：NICE AI Talk

YouTube

往期活动

欢迎！要参加活动，请在下方注册。

系统将要求你使用钱包验证代币所有权。

活动详情

Welcome to NICE talk about the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models!

Our talk will be host on livestream Youtube Channel: https://www.youtube.com/watch?v=kv8t7L9xz4s

This talk will be in Chinese!

Our invited speaker: Charlie Zhang

Charlie (Chenlong) Zhang, currently a third-year master's student at the Institute of Automation, Chinese Academy of Sciences, advised by Yubo Chen and Jun Zhao. Previously, he was a research intern with Xiang Yue at CMU. His work mainly focuses on the model-data interplay of language models.

Our Host: Wenyue Hua

Wenyue Hua is currently a senior researcher at Microsoft Research, AI Frontiers. She was a CS postdoctoral researcher at UCSB working with Prof. William Wang. She received her Ph.D. from Rutgers University-New Brunswick, under the supervision of Professor Yongfeng Zhang. Her research focuses on the safety and efficiency of LLM agents, multi-agent interaction, and LLM reasoning. She was selected as KAUST AI Rising Star in 2025, published over 40 papers at top natural language processing and machine learning conferences such as ACL, EMNLP, ICLR, NeurIPS, TACL.

Talk Abstract:

There is ongoing debate over whether reinforcement learning (RL) truly enhances language-model reasoning beyond pre-training. Using a fully controlled synthetic framework with atomic operations, step-by-step traces, and carefully designed training distributions, we train hundreds of GPT-2–scale models from scratch on GSM-like data to disentangle the effects of pre-training, mid-training, and RL. Our results reconcile competing views: RL produces genuine capability gains only when pre-training leaves headroom and RL data target tasks near the model’s competence boundary; minimal but sufficient pre-training enables robust RL-driven contextual generalization; mid-training delivers substantial improvements under fixed compute compared with RL alone; and process-level rewards reduce reward hacking and improve reasoning fidelity. Together, these findings clarify how different training stages interact and offer guidance for building more effective reasoning-focused language models.

主办人

15 人参加

人工智能