Cover Image for BLISS Reading Group - Jan 12
Cover Image for BLISS Reading Group - Jan 12
Hosted By
1 Going

BLISS Reading Group - Jan 12

Hosted by BLISS Berlin
Registration
Approval Required
Your registration is subject to approval by the host.
Welcome! To join the event, please register below.
About Event

​This week we are continuing our reading group on Technical Alignment in AI, led by Craig Dickson.

Our paper this week is Adversarial Training for High-Stakes Reliability (Ziegler et al., 2022).

A NeurIPS paper from Redwood Research tackling extreme reliability in AI behaviors. The team took a language model tasked with never producing violent or gory outputs (“avoid describing injuries”) and used adversarial examples to stress-test it. They had another model (and humans with special tools) generate tricky prompts to make the model slip up, then trained the model on those failure cases. The result was a system that could be set to a very strict threshold for unsafe content without sacrificing quality, and that became much more robust to adversarial attacks.

In their metrics, adversarial training doubled the time it took for red-teamers to find a new exploit (from 13 to 26 minutes with tools, for example) while maintaining in-distribution performance. This paper is a concrete example of empirical alignment work: it shows how adversarial methods can patch vulnerabilities and push a model closer to “zero undesirable outputs,” which is critical for high-stakes deployments.

Location
Merantix AI Campus
Max-Urich-Straße 3, 13355 Berlin, Germany
Hosted By
1 Going