

Anti-Scheming: how would we train a model not to scheme? (Bronson Schoen and Marius Hobbhahn, Apollo Research)
Registration
About Event
This is the fifth event of UK AI Forum's Artificial Agency speaker series, and our keynote speaker event of the season.
Bronson Schoen (senior research engineer, Apollo Research) and Marius Hobbhahn (CEO, Apollo Research) will present Apollo’s & OpenAI’s joint work: ‘Stress Testing Deliberative Alignment for Anti-Scheming Training’. They attempted to use deliberative alignment to teach o3 and o4-mini general principles to not be deceptive.
Bronson and Marius will talk about their findings, including: a) how well it works, b) effects on the situational awareness of the models, c) that the chain-of-thought of o3 is not always human-interpretable anymore, and more.
Location