

Paper Reading Group
This week we're digging into Anthropics's Alignment Risk Update: Claude Mythos Preview
Mythos Preview represents a significant capability jump. Its cybersecurity capabilities are particularly striking: it autonomously discovered thousands of zero-day vulnerabilities across every major OS and browser. These capabilities prompted Anthropic to withhold general release and instead launch Project Glasswing, giving a coalition of organizations (including AWS, Apple, Microsoft, Google, and others) access to the model to identify and patch vulnerabilities in critical software before similar capabilities proliferate.
This kind of capability jump has the potential to significantly shift the risk profile of frontier models, which is why we'll be discussing Anthropic's assessment of how they think about and manage these risks.
Suggested Readings are Sections 3, 4, 5.3, 5.4, but there will be a presentation summarizing relevant content prior to the discussion.
As we often have lively discussions that rely on knowledge of key terms in the field, we recommend that you have some experience in ML, as well as some background in AI security, for an engaging experience. If you have completed Zurich’s AISF Programme or an analogous one, you are well-suited to join. A good way to check if you meet these criteria is to go through the syllabus of the AI Alignment course and see if you are familiar with these topics.
Interested? We would love to have you join us!