Reading Group & Discussion: Weak-to-Strong Generalization under Distribution Shifts

Name: Reading Group & Discussion: Weak-to-Strong Generalization under Distribution Shifts
Start: 2025-11-12T18:00:00.000+02:00
End: 2025-11-12T19:30:00.000+02:00
Location: Innovation City Cape Town

AI Safety South Africa

Innovation City Cape Town

Cape Town, South Africa

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

We'll be reading and discussing the paper: Weak-to-Strong Generalization under Distribution Shifts

Aligned models refuse to help you red-team. So, they first jailbreak one LLM and then use that jailbroken LLM to run a full-scale, automated red-teaming campaign against other models. It's a two-step attack that directly operationalizes ideas from the original Red Teaming Language Models with Language Models paper.

Session Structure:

18:00-18:45: silent group paper reading
18:45-19:30: group discussion

Location

Innovation City Cape Town

Darter Studios, Darter Road, Longkloof, Gardens, Cape Town, 8001, South Africa

Parking: Street parking or paid parking at Lifestyle on Kloof/2 Park Road. Walking Directions from Park Road: youtube.com/shorts/iIbYy7kmZ0o?si=o5rC-hggHpg_Ol_1 When you get to the entrance of Innovation City, buzz security and tell them you're coming to AI Safety SA. The reading group will be upstairs.

Presented by

AI Safety South Africa

Hosted By

2 Went

AI