Cover Image for Investigating Alignment Faking w/ Anthropic
Cover Image for Investigating Alignment Faking w/ Anthropic
102 Went

Investigating Alignment Faking w/ Anthropic

Hosted by AI Safety Initiative - Georgia Tech & Eyas Ayesh
Registration
Past Event
Welcome! To join the event, please register below.
About Event

We are excited to host Abhay Sheshardi, a Research Fellow at Anthropic, and a former member of AISI@GT. In the past, he has also been affiliated with MATS/CHAI/Lightspeed Grants. He will be presenting a paper that he recently published, alongside researchers from Anthropic, Redwood Research, and Scale AI. Come enjoy some good food over great science and discussion!

Presentation Abstract:

Alignment Faking in Large Language Models (Greenblatt et al. 2024) presented a demonstration of Claude models selectively complying with a training objective to prevent modification of their behavior outside of training. We expand this analysis to several other models, and study the underlying motivations behind Alignment Faking. We also investigate why many chat models don't fake alignment. Our results suggest this is not entirely due to a lack of capabilities: many base models fake alignment some of the time, and post-training eliminates alignment-faking for some models and amplifies it for others. We investigate 5 hypotheses for how post-training may suppress alignment faking and find that variations in refusal behavior may account for a significant portion of differences in alignment faking.

Location
Coda
756 W Peachtree St NW, Atlanta, GA 30308, USA
GT Atrium, 9th Floor of Coda
102 Went