NICE Talk: On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

Hosted by NICE AI Talk

YouTube

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Welcome to NICE talk about the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models!

Our talk will be host on livestream Youtube Channel: https://www.youtube.com/watch?v=kv8t7L9xz4s

This talk will be in Chinese!

Our invited speaker: Charlie Zhang

Charlie (Chenlong) Zhang, currently a third-year master's student at the Institute of Automation, Chinese Academy of Sciences, advised by Yubo Chen and Jun Zhao. Previously, he was a research intern with Xiang Yue at CMU. His work mainly focuses on the model-data interplay of language models.

Our Host: Wenyue Hua

Wenyue Hua is currently a senior researcher at Microsoft Research, AI Frontiers. She was a CS postdoctoral researcher at UCSB working with Prof. William Wang. She received her Ph.D. from Rutgers University-New Brunswick, under the supervision of Professor Yongfeng Zhang. Her research focuses on the safety and efficiency of LLM agents, multi-agent interaction, and LLM reasoning. She was selected as KAUST AI Rising Star in 2025, published over 40 papers at top natural language processing and machine learning conferences such as ACL, EMNLP, ICLR, NeurIPS, TACL.

Talk Abstract:

There is ongoing debate over whether reinforcement learning (RL) truly enhances language-model reasoning beyond pre-training. Using a fully controlled synthetic framework with atomic operations, step-by-step traces, and carefully designed training distributions, we train hundreds of GPT-2–scale models from scratch on GSM-like data to disentangle the effects of pre-training, mid-training, and RL. Our results reconcile competing views: RL produces genuine capability gains only when pre-training leaves headroom and RL data target tasks near the model’s competence boundary; minimal but sufficient pre-training enables robust RL-driven contextual generalization; mid-training delivers substantial improvements under fixed compute compared with RL alone; and process-level rewards reduce reward hacking and improve reasoning fidelity. Together, these findings clarify how different training stages interact and offer guidance for building more effective reasoning-focused language models.

Hosted By

15 Went