Anti-Scheming: how would we train a model not to scheme? (Bronson Schoen and Marius Hobbhahn, Apollo Research)

Name: Anti-Scheming: how would we train a model not to scheme? (Bronson Schoen and Marius Hobbhahn, Apollo Research)
Start: 2025-11-08T18:00:00.000+00:00
End: 2025-11-08T19:30:00.000+00:00
Location: Department of Engineering

Artificial Agency Speaker Series

Department of Engineering

Cambridge, England

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

This is the fifth event of UK AI Forum's Artificial Agency speaker series, and our keynote speaker event of the season. Accessibility information for the Constance Tipper Lecture Theatre is available here: https://www.accessable.co.uk/university-of-cambridge/trumpington-street-and-lensfield-road/access-guides/bng-01-dr-constance-tipper-lecture-theatre

Bronson Schoen (senior research engineer, Apollo Research) and Marius Hobbhahn (CEO, Apollo Research) will present Apollo’s & OpenAI’s joint work: ‘Stress Testing Deliberative Alignment for Anti-Scheming Training’. They attempted to use deliberative alignment to teach o3 and o4-mini general principles to not be deceptive.

Bronson and Marius will talk about their findings, including: a) how well it works, b) effects on the situational awareness of the models, c) that the chain-of-thought of o3 is not always human-interpretable anymore, and more.

Location

Department of Engineering

Engineering Dept, Trumpington St, Cambridge CB2 1PZ, UK

The talk will be held in the Constance Tipper Lecture Theatre.

Presented by

Artificial Agency Speaker Series

Hosted By

136 Went

AI