Cover Image for How do we solve alignment?
Cover Image for How do we solve alignment?
Avatar for Meridian
Presented by
Meridian
Hosted By
13 Went
Registration
Past Event
Welcome! To join the event, please register below.
About Event

How Do We Solve Alignment?

A technical AI safety speaker series hosted by CAISH at Meridian, Cambridge.

This week: Inoculation Prompting -- Henry Colbert

Henry is an ERA fellow researching inoculation prompting as a defence against emergent misalignment. The technique works by prepending prompts during finetuning that elicit undesirable behaviours, which reduces a model's propensity to display those behaviours at test time. It's one of the few alignment techniques that appears to work against emergent misalignment, but open questions remain around its brittleness, scalability, and whether it partly just pushes misaligned behaviour behind a backdoor.

Henry will present his current work on combining inoculation prompting with filtered pretraining, and the open questions he's prioritising.

Pre-reading: Henry's recent LessWrong post covers the landscape well. We'd encourage attendees to read it beforehand so we can jump straight into discussion (https://www.lesswrong.com/posts/Km28joWnihcGEKirG/inoculation-prompting-open-questions-and-my-research).

Attendance requires approval. When you register you'll be asked about your background in AI safety. We will be providing food!

Location
Meridian Cambridge
53, 54 Sidney St, Cambridge CB2 3HX, UK
Avatar for Meridian
Presented by
Meridian
Hosted By
13 Went