Cover Image for Paper Club: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs
Cover Image for Paper Club: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs
15 Going

Paper Club: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Registration
Event Full
If you’d like, you can join the waitlist.
Please click on the button below to join the waitlist. You will be notified if additional spots become available.
About Event

Technical Note: This event is intended for participants with a technical background. We strongly encourage reading the paper ahead of time to fully engage with the discussion.

This paper addresses a critical problem with open-weight AI models: traditional post-training safety measures can be bypassed with just a few hundred fine-tuning steps. Instead of teaching models to refuse harmful requests after training, the researchers test whether filtering dangerous content from pretraining data creates more durable safeguards. They develop a multi-stage pipeline to remove biothreat-related content from training data, creating models with "deep ignorance" of certain dangerous topics.

The results show filtered models resist adversarial fine-tuning for up to 10,000 steps and 300 million tokens—over an order of magnitude better than existing methods—without degrading general capabilities. However, the approach has limitations: filtered models can still use harmful information when provided in context (like through search), and the technique appears most effective for factual knowledge rather than behavioral patterns like toxicity. The authors demonstrate that combining data filtering with circuit-breaking techniques yields even stronger defenses, suggesting a defense-in-depth approach for open-weight model safety.

The paper can be found here: https://arxiv.org/abs/2508.06601

Location
Lorong AI (WeWork@22 Cross St.)
15 Going