Cover Image for How Training Data Shapes AI Values - Alignment Pretraining
Cover Image for How Training Data Shapes AI Values - Alignment Pretraining
Avatar for BlueDot Impact
Presented by
BlueDot Impact
We’re building the workforce needed to safely navigate AGI. Contact: [email protected]

How Training Data Shapes AI Values - Alignment Pretraining

Zoom
Registration
Welcome! To join the event, please register below.
About Event

What if the stories we tell about AI are shaping how AI actually behaves?

In this talk, Kyle O'Brien will present findings from their new paper on alignment pretraining.

LLMs learn alignment (or misalignment) from how AIs are portrayed in their training data. When models are trained on text depicting misaligned AI - from science fiction dystopias to technical AI safety papers - they become less aligned. We may be inadvertently making alignment harder by not curating what models learn about themselves.

But there's good news. We can flip this dynamic. By introducing synthetic data featuring aligned, beneficial AI behavior, we significantly improve model alignment. When most of the discourse a model sees about AI depicts good behavior, the model follows suit.

This work represents the first practical demonstration of alignment pretraining - and opens up a promising new subfield for safety research.

You'll learn:

  • Why current training corpora may be undermining alignment efforts

  • How synthetic "good examples" of AI behavior improve outcomes

  • The research agenda for alignment pretraining going forward


Links:

Want to go deeper? -> Apply for a BlueDot course and take your first step today!

Avatar for BlueDot Impact
Presented by
BlueDot Impact
We’re building the workforce needed to safely navigate AGI. Contact: [email protected]