Cover Image for CAIA Speaker Event: Gabriel Wu, OpenAI (Virtual)
Cover Image for CAIA Speaker Event: Gabriel Wu, OpenAI (Virtual)
193 Went

CAIA Speaker Event: Gabriel Wu, OpenAI (Virtual)

Hosted by Ayushi Mehrotra
Registration
Past Event
Welcome! To join the event, please register below.
About Event

Who: Gabriel Wu, OpenAI (Virtual)
When: February 27, 5–6 pm PT
Where: Chen 100, Caltech

Title: Teaching LLMs to Confess

Abstract: We train GPT-5 to self-report misbehavior by producing an auxiliary “confession message” that receives an independent reward during RL. We find that models are typically honest in their confessions, and this honesty increases with training. We will also discuss connections between our approach and standard chain-of-thought monitoring, and whether we expect confessions to work on more egregiously misaligned models.

Bio: Gabriel Wu is a researcher on the Alignment team at OpenAI where he works on training models to more reliably follow human instructions. Previously, he worked at the Alignment Research Center and led the AI Safety Student Team at Harvard.

Everyone is welcome: no specific technical background is required. Come learn and ask questions. And yes, we will have pizza and boba.

Location
Tianqiao and Chrissy Chen Neuroscience Research building
S Wilson Ave &, E Del Mar Blvd, Pasadena, CA 91106, USA
193 Went