

Reading Group: Olmix: A Framework for Data Mixing Throughout LM Development
Join us for the launch of the Snorkel AI Reading Group, a recurring forum to explore the latest frontier developments in AI while building meaningful connections within the community.
In our inaugural session, Mayee Chen of Stanford AI Research Lab will dive into her paper “Olmix: A Framework for Data Mixing Throughout LM Development.”
Agenda:
5:30pm - doors open
6pm - talk begins
Light drinks and appetizers provided
Training data is one of the most powerful levers in modern language models. This talk dives into data mixing, a critical but under-explored factor that can significantly impact model performance.
You’ll learn:
What actually works (and doesn’t) when mixing data across domains
Which design choices meaningfully improve model performance
How to handle constantly evolving datasets in real-world LM development
A practical method to reduce compute by 74% while maintaining performance
How smarter data mixing can drive double-digit gains on downstream tasks