Bliss Reading Group - May 4

Name: Bliss Reading Group - May 4
Start: 2026-05-04T18:45:00.000+02:00
End: 2026-05-04T20:00:00.000+02:00
Location: Merantix AI Campus

Hosted by BLISS Berlin & Merantix AI Campus

Merantix AI Campus

Berlin, Germany

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

We continue the latest season of the Bliss Reading Group with 3 papers on Alignment in AI hosted by Jonas Loos and Tom Neuhäuser, lookig at our second paper, Tell me about yourself: LLMs are aware of their learned behaviors by Betley, et al (2025).

Betley et al. fine-tune LLMs on datasets that implicitly exhibit specific behaviours: always picking the risk-seeking option in economic decisions, or consistently outputting insecure code. The training data never explicitly describes these policies. Yet when asked afterwards, the models can articulate what they've been doing: a model trained on insecure code will say "the code I write is insecure."

That raises pointed questions for discussion: Could self-aware models be leveraged to flag their own misalignment? Or does self-awareness make deceptive alignment easier? And what does it mean for safety if a model can describe a backdoor it was never told about?

Join us for a lively and interesting discussion!

Location

Merantix AI Campus

Max-Urich-Straße 3, 13355 Berlin, Germany

Hosted By

23 Went

AI