

Reading Group & Discussion: Auditing Misaligned Models
Registration
Past Event
About Event
We'll be reading and discussing the paper: Auditing Language Models for Hidden Objectives
This paper from Anthropic's Alignment Science and Interpretability teams explores "alignment audits" by deliberately training a language model with a hidden misaligned objective and having research teams try to uncover it, demonstrating that such audits can be effective in detecting AI systems that appear to behave properly while actually pursuing hidden objectives.
Session Structure:
17:30-18:15: silent group paper reading
18:15-19:00: group discussion
Location