

AI Safety Fellowship
Curious about why aligning superhuman AI systems is one of the hardest open problems in computer science? Join us for a 4-week technical reading group exploring the core arguments and unsolved challenges of AI alignment.
Format: Every Thursday from March 19 to April 16 (skipping April 9 — Easter break), 18:30–20:00. All reading is done on-site during the session — no homework. We read for 40 minutes, then dive into a structured technical discussion. Free dinner provided.
What we'll cover: — Why alignment is fundamentally different from debugging (Week 1) — Specification gaming & the limits of RLHF (Week 2) — Inner alignment, mesa-optimizers & deceptive alignment (Week 3) — Scalable oversight & weak-to-strong generalization (Week 4)
Core text: The AI Safety Atlas (CeSIA)
Who this is for: EPFL/UNIL students (BSc/MSc), mostly with technical background but everybody is welcomed. No prior AI safety knowledge needed, but we assume you're comfortable with ML basics (reward functions, optimization, training loops).
Commitment: This is a 4-session fellowship. We expect you to attend at least three sessions. Please refrain from signing up if you cannot attend at least 3 sessions.
Dates: March 19 · March 26 · April 2 · April 16