Cover Image for Bliss Reading Group - April 27
Cover Image for Bliss Reading Group - April 27
Hosted By

Bliss Reading Group - April 27

Hosted by BLISS Berlin
Registration
Approval Required
Your registration is subject to host approval.
Welcome! To join the event, please register below.
About Event

We start the latest season of the Bliss Reading Group with 3 papers on Alignment in AI hosted by Jonas Loos and Tom Neuhäuser, beginning with Persona Features Control Emergent Misalignment from Wang, et al (2025).

Building on the earlier discovery of "emergent misalignment", Wang et al. dig into why this happens. Using sparse autoencoders to compare model internals before and after fine-tuning, they identify a set of "misaligned persona" features in activation space which appear to control this behaviour.

The paper raises compelling questions for discussion: Why does narrow bad-advice training activate these broad persona features? How reliable are SAE-based mitigations in practice? And what does this tell us about the internal "character" that models develop through training?

Join us for a lively and interesting discussion!

Location
Merantix AI Campus
Max-Urich-Straße 3, 13355 Berlin, Germany
Hosted By