Cover Image for 90/30 Club (ML reading) #41: Direct Preference Optimization (DPO)
Cover Image for 90/30 Club (ML reading) #41: Direct Preference Optimization (DPO)
Avatar for 90/30 Club
Presented by
90/30 Club
2 Going

90/30 Club (ML reading) #41: Direct Preference Optimization (DPO)

Register to See Address
San Francisco, California
Registration
Approval Required
Your registration is subject to host approval.
Welcome! To join the event, please register below.
About Event

Week 41: Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model

​​The Paper Link Here

Direct Preference Optimization (DPO) introduces a simpler and more stable method for aligning large language models with human preferences without requiring reinforcement learning or an explicit reward model. Instead of the traditional RLHF pipeline, which involves reward modeling, policy optimization, and complex training loops, DPO reframes preference alignment as a supervised learning problem by directly optimizing the model to prefer chosen responses over rejected ones using a closed-form objective derived from KL-constrained RL.

The paper demonstrates that DPO achieves competitive or superior alignment performance compared to PPO-based RLHF while being significantly easier to implement, more stable during training, and computationally efficient. By showing that preference optimization can be solved directly through likelihood-based training, DPO challenges the necessity of separate reward models and provides a scalable alternative for aligning foundation models.


​​Join us at Mox to explore:

​​- How DPO eliminates reward model training and PPO optimization
- Why preference optimization can be reframed as a supervised objective

​​🔎Analyzed Papers

​​​Discussion at 20:00, (optional) quiet reading from 19:00.

Location
Please register to see the exact location of this event.
San Francisco, California
Avatar for 90/30 Club
Presented by
90/30 Club
2 Going