Cover Image for AI Safety Poland Talks #3

Presented by

AI Safety Poland

AI Safety Poland is a community in Poland dedicated to reducing the risks posed by artificial intelligence.

Hosted By

10 Going

AI Safety Poland Talks #3

AI Safety Poland

Google Meet

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Welcome to AI Safety Poland Talks!

A biweekly series where researchers, professionals, and enthusiasts from Poland or connected to the Polish AI community share their work on AI Safety.

💁 Topic: SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders
📣 Speaker: Kamil Deja
🇬🇧 Language: English
🗓️ Date: 04.12.2025, 18:00
📍 Location: Online

Speaker Bio
Kamil Deja is a Team Leader at the IDEAS Research Institute and an Assistant Professor at the Warsaw University of Technology. His research centers on generative modeling, with a particular focus on diffusion models and their applications. He has gained international experience through research internships at Vrije Universiteit Amsterdam, Sapienza University of Rome, and Amazon Alexa, as well as through collaboration with CERN. He is a recipient of the FNP START Award, recognizing him as one of the top 100 young researchers in Poland.

Abstract
Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Our evaluation shows that SAeUron outperforms existing approaches on the UnlearnCanvas benchmark for concepts and style unlearning, and effectively eliminates nudity when evaluated with I2P. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron mitigates the possibility of generating unwanted content under adversarial attack.

Presented by

AI Safety Poland

AI Safety Poland is a community in Poland dedicated to reducing the risks posed by artificial intelligence.

Hosted By

10 Going