

Presented by
BlueDot Impact
Hosted By
AI x Cyber Reading Group
About Event
This week we will discuss Understanding the Effects of Safety Unalignment on Large Language Models. This very recent paper asks: When we remove a model’s safety guardrails, what else changes beyond its willingness to answer harmful prompts? The paper compares two ways of “unaligning” LLMs across six models and finds weight orthogonalization is more concerning than jailbreak-tuning, because it preserves general language performance while making models more effective at harmful tasks.
We'll discuss the paper's hypotheses, methodology, results and implications, and then have a broader discussion about concerns around open-weight models, especially in the wake of Claude Mythos.
Paper: https://arxiv.org/abs/2604.02574
Presented by
BlueDot Impact
Hosted By