

Paper Club: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs
Technical Note: This event is intended for participants with a technical background. We strongly encourage reading the paper ahead of time to fully engage with the discussion.
This paper addresses a critical problem with open-weight AI models: traditional post-training safety measures can be bypassed with just a few hundred fine-tuning steps. Instead of teaching models to refuse harmful requests after training, the researchers test whether filtering dangerous content from pretraining data creates more durable safeguards. They develop a multi-stage pipeline to remove biothreat-related content from training data, creating models with "deep ignorance" of certain dangerous topics.
The results show filtered models resist adversarial fine-tuning for up to 10,000 steps and 300 million tokens—over an order of magnitude better than existing methods—without degrading general capabilities. However, the approach has limitations: filtered models can still use harmful information when provided in context (like through search), and the technique appears most effective for factual knowledge rather than behavioral patterns like toxicity. The authors demonstrate that combining data filtering with circuit-breaking techniques yields even stronger defenses, suggesting a defense-in-depth approach for open-weight model safety.
The paper can be found here: https://arxiv.org/abs/2508.06601