

How do we solve alignment?
How Do We Solve Alignment?
A technical AI safety speaker series hosted by CAISH at Meridian, Cambridge.
This week: Inoculation Prompting -- Henry Colbert
Henry is an ERA fellow researching inoculation prompting as a defence against emergent misalignment. The technique works by prepending prompts during finetuning that elicit undesirable behaviours, which reduces a model's propensity to display those behaviours at test time. It's one of the few alignment techniques that appears to work against emergent misalignment, but open questions remain around its brittleness, scalability, and whether it partly just pushes misaligned behaviour behind a backdoor.
Henry will present his current work on combining inoculation prompting with filtered pretraining, and the open questions he's prioritising.
Pre-reading: Henry's recent LessWrong post covers the landscape well. We'd encourage attendees to read it beforehand so we can jump straight into discussion (https://www.lesswrong.com/posts/Km28joWnihcGEKirG/inoculation-prompting-open-questions-and-my-research).
Attendance requires approval. When you register you'll be asked about your background in AI safety. We will be providing food!