

CAIA Speaker Event: Gabriel Wu, OpenAI (Virtual)
Who: Gabriel Wu, OpenAI (Virtual)
When: February 27, 5–6 pm PT
Where: Chen 100, Caltech
Title: Teaching LLMs to Confess
Abstract: We train GPT-5 to self-report misbehavior by producing an auxiliary “confession message” that receives an independent reward during RL. We find that models are typically honest in their confessions, and this honesty increases with training. We will also discuss connections between our approach and standard chain-of-thought monitoring, and whether we expect confessions to work on more egregiously misaligned models.
Bio: Gabriel Wu is a researcher on the Alignment team at OpenAI where he works on training models to more reliably follow human instructions. Previously, he worked at the Alignment Research Center and led the AI Safety Student Team at Harvard.
Everyone is welcome: no specific technical background is required. Come learn and ask questions. And yes, we will have pizza and boba.