Name: Bliss Reading Group - April 27
Start: 2026-04-27T18:45:00.000+02:00
End: 2026-04-27T20:00:00.000+02:00
Location: Merantix AI Campus

Personal

BLISS Berlin

We start the latest season of the Bliss Reading Group with 3 papers on Alignment in AI hosted by Jonas Loos and Tom Neuhäuser, beginning with 

Persona Features Control Emergent Misalignment

Building on the earlier discovery of "emergent misalignment", Wang et al. dig into 

 this happens. Using sparse autoencoders to compare model internals before and after fine-tuning, they identify a set of "misaligned persona" features in activation space which appear to control this behaviour. 

The paper raises compelling questions for discussion: Why does narrow bad-advice training activate these broad persona features? How reliable are SAE-based mitigations in practice? And what does this tell us about the internal "character" that models develop through training?

Join us for a lively and interesting discussion!

Bliss Reading Group - April 27

Matt Heard

Jonas Loos