Cover Image for Reading Group & Discussion: Weak-to-Strong Generalization under Distribution Shifts
Cover Image for Reading Group & Discussion: Weak-to-Strong Generalization under Distribution Shifts
Avatar for AI Safety South Africa
Hosted By
2 Went

Reading Group & Discussion: Weak-to-Strong Generalization under Distribution Shifts

Registration
Past Event
Welcome! To join the event, please register below.
About Event

​We'll be reading and discussing the paper: Weak-to-Strong Generalization under Distribution Shifts

Aligned models refuse to help you red-team. So, they first jailbreak one LLM and then use that jailbroken LLM to run a full-scale, automated red-teaming campaign against other models. It's a two-step attack that directly operationalizes ideas from the original Red Teaming Language Models with Language Models paper.

Session Structure:

  • 18:00-18:45: silent group paper reading

  • 18:45-19:30: group discussion

Location
Innovation City Cape Town
Darter Studios, Darter Road, Longkloof, Gardens, Cape Town, 8001, South Africa
Parking: Street parking or paid parking at Lifestyle on Kloof/2 Park Road. Walking Directions from Park Road: youtube.com/shorts/iIbYy7kmZ0o?si=o5rC-hggHpg_Ol_1 When you get to the entrance of Innovation City, buzz security and tell them you're coming to AI Safety SA. The reading group will be upstairs.
Avatar for AI Safety South Africa
Hosted By
2 Went