

How Training Data Shapes AI Values - Alignment Pretraining
What if the stories we tell about AI are shaping how AI actually behaves?
In this talk, Kyle O'Brien will present findings from their new paper on alignment pretraining.
LLMs learn alignment (or misalignment) from how AIs are portrayed in their training data. When models are trained on text depicting misaligned AI - from science fiction dystopias to technical AI safety papers - they become less aligned. We may be inadvertently making alignment harder by not curating what models learn about themselves.
But there's good news. We can flip this dynamic. By introducing synthetic data featuring aligned, beneficial AI behavior, we significantly improve model alignment. When most of the discourse a model sees about AI depicts good behavior, the model follows suit.
This work represents the first practical demonstration of alignment pretraining - and opens up a promising new subfield for safety research.
You'll learn:
Why current training corpora may be undermining alignment efforts
How synthetic "good examples" of AI behavior improve outcomes
The research agenda for alignment pretraining going forward
Links:
Want to go deeper? -> Apply for a BlueDot course and take your first step today!