We’re building the workforce needed to safely navigate AGI.   

Contact: team@bluedot.org

BlueDot Impact

Understanding the Effects of Safety Unalignment on Large Language Models.  

This very recent paper asks: When we remove a model’s safety guardrails, what else changes beyond its willingness to answer harmful prompts? The paper compares two ways of “unaligning” LLMs across six models and finds 

, because it preserves general language performance while making models more effective at harmful tasks.

We'll discuss the paper's hypotheses, methodology, results and implications, and then have a broader discussion about concerns around open-weight models, especially in the wake of Claude Mythos.

AI x Cyber Reading Group