Cover Image for Reading Group & Discussion: Auditing Misaligned Models
Cover Image for Reading Group & Discussion: Auditing Misaligned Models
Avatar for AI Safety South Africa
8 Went

Reading Group & Discussion: Auditing Misaligned Models

Registration
Past Event
Welcome! To join the event, please register below.
About Event

​We'll be reading and discussing the paper: Auditing Language Models for Hidden Objectives

This paper from Anthropic's Alignment Science and Interpretability teams explores "alignment audits" by deliberately training a language model with a hidden misaligned objective and having research teams try to uncover it, demonstrating that such audits can be effective in detecting AI systems that appear to behave properly while actually pursuing hidden objectives.

Session Structure:

  • 17:30-18:15: silent group paper reading

  • 18:15-19:00: group discussion

Location
Innovation City Cape Town
Darter Studios, Darter Road, Longkloof, Gardens, Cape Town, 8001, South Africa
Avatar for AI Safety South Africa
8 Went