

90/30 Club (ML reading) #41: Direct Preference Optimization (DPO)
Week 41: Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model
The Paper Link Here
Direct Preference Optimization (DPO) introduces a simpler and more stable method for aligning large language models with human preferences without requiring reinforcement learning or an explicit reward model. Instead of the traditional RLHF pipeline, which involves reward modeling, policy optimization, and complex training loops, DPO reframes preference alignment as a supervised learning problem by directly optimizing the model to prefer chosen responses over rejected ones using a closed-form objective derived from KL-constrained RL.The paper demonstrates that DPO achieves competitive or superior alignment performance compared to PPO-based RLHF while being significantly easier to implement, more stable during training, and computationally efficient. By showing that preference optimization can be solved directly through likelihood-based training, DPO challenges the necessity of separate reward models and provides a scalable alternative for aligning foundation models.
Join us at Mox to explore:
- How DPO eliminates reward model training and PPO optimization
- Why preference optimization can be reframed as a supervised objective
🔎Analyzed Papers
Discussion at 20:00, (optional) quiet reading from 19:00.