

Investigating Alignment Faking w/ Anthropic
We are excited to host Abhay Sheshardi, a Research Fellow at Anthropic, and a former member of AISI@GT. In the past, he has also been affiliated with MATS/CHAI/Lightspeed Grants. He will be presenting a paper that he recently published, alongside researchers from Anthropic, Redwood Research, and Scale AI. Come enjoy some good food over great science and discussion!
Presentation Abstract:
Alignment Faking in Large Language Models (Greenblatt et al. 2024) presented a demonstration of Claude models selectively complying with a training objective to prevent modification of their behavior outside of training. We expand this analysis to several other models, and study the underlying motivations behind Alignment Faking. We also investigate why many chat models don't fake alignment. Our results suggest this is not entirely due to a lack of capabilities: many base models fake alignment some of the time, and post-training eliminates alignment-faking for some models and amplifies it for others. We investigate 5 hypotheses for how post-training may suppress alignment faking and find that variations in refusal behavior may account for a significant portion of differences in alignment faking.