AI Safety South Africa

​We'll be reading and discussing the paper: 

Auditing Language Models for Hidden Objectives

This paper from Anthropic's Alignment Science and Interpretability teams explores "alignment audits" by deliberately training a language model with a hidden misaligned objective and having research teams try to uncover it, demonstrating that such audits can be effective in detecting AI systems that appear to behave properly while actually pursuing hidden objectives.

Reading Group & Discussion: Auditing Misaligned Models

JUNIOR Reed Alimasi

Gabriela Carolus