

Models in Moral Mazes: Anthropic scholars present research on misalignment in AI organizations
Judy Shen and Daniel Zhu of Anthropic are visiting Mox to present a preview of their upcoming paper, "Agents, Inc. Misalignment in AI Organizations of Aligned Agents."
Schedule:
6:30PM - Doors
7:00PM - Presentation
7:30PM - Q&A
Paper Abstract: Alignment techniques have thus far focused on single models. However, as large language models are increasingly deployed in orchestrations of multiple agents, we must also study misalignment in multi-agent settings, or AI organizations. Our work examines two such settings: an AI consultancy providing practical solutions to business problems and an AI team writing software. We define two models of misalignment and find that AI organizations can be more effective than single agents at achieving productivity goals—but are also more willing to make ethical tradeoffs. Our results suggest that alignment of multiple AI agents is a challenging problem that does not simply follow from individual model alignment.
Authors (alphabetical): Erik Jones, Judy Hanwen Shen, Jascha Sohl-Dickstein, and Daniel Zhu