

Reading Group & Discussion: Weak-to-Strong Generalization under Distribution Shifts
Registration
Past Event
About Event
We'll be reading and discussing the paper: Weak-to-Strong Generalization under Distribution Shifts
Aligned models refuse to help you red-team. So, they first jailbreak one LLM and then use that jailbroken LLM to run a full-scale, automated red-teaming campaign against other models. It's a two-step attack that directly operationalizes ideas from the original Red Teaming Language Models with Language Models paper.
Session Structure:
18:00-18:45: silent group paper reading
18:45-19:30: group discussion
Location
Innovation City Cape Town
Darter Studios, Darter Road, Longkloof, Gardens, Cape Town, 8001, South Africa
Parking:
Street parking or paid parking at Lifestyle on Kloof/2 Park Road.
Walking Directions from Park Road:
youtube.com/shorts/iIbYy7kmZ0o?si=o5rC-hggHpg_Ol_1
When you get to the entrance of Innovation City, buzz security and tell them you're coming to AI Safety SA. The reading group will be upstairs.