Cover Image for BLISS Reading Group - Dec 15
Cover Image for BLISS Reading Group - Dec 15
Hosted By
1 Going

BLISS Reading Group - Dec 15

Hosted by BLISS Berlin
Registration
Approval Required
Your registration is subject to approval by the host.
Welcome! To join the event, please register below.
About Event

​This week we are continuing our reading group on Technical Alignment in AI, led by Craig Dickson.

Our paper this week is Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022).

An Anthropic study proposing to replace some human oversight with an AI-mediated process. Instead of relying on human labelers for every instance of harmful content, they give the model a “constitution” of principles (a set of rules) and have the AI generate its own critiques and revisions to its answers.

Through this two-phase process (self-critiquing supervised fine-tuning, then reinforcement learning with an AI judge), they train a chatbot to be harmless but non-evasive – it refuses unsafe requests by explaining its objections, without simply dodging . This work is important as a practical alignment strategy that leverages AI feedback (RLAIF) rather than extensive human data. It demonstrated that an AI can improve itself under guided principles to reduce harmful outputs, pointing toward more scalable oversight methods.

Location
Merantix AI Campus
Max-Urich-Straße 3, 13355 Berlin, Germany
Hosted By
1 Going