Cover Image for Models in Moral Mazes: Anthropic scholars present research on misalignment in AI organizations

Presented by

Mox is a gathering place for startups on the cutting edge, EAs and others seeking to improve the world, AI & AI safety labs, indie researchers and builders, and writers; artists; & masters of craft!

Hosted By

121 Went

Private Event

Models in Moral Mazes: Anthropic scholars present research on misalignment in AI organizations

Mox

San Francisco, California

Past Event

Suggested Donation

$10.00

Pay what you want

Welcome! To join the event, please get your ticket below.

You will be asked to verify token ownership with your wallet.

About Event

Judy Shen and Daniel Zhu of Anthropic are visiting Mox to present a preview of their upcoming paper, "Agents, Inc. Misalignment in AI Organizations of Aligned Agents."

Schedule:

6:30PM - Doors

7:00PM - Presentation

7:30PM - Q&A

Paper Abstract: Alignment techniques have thus far focused on single models. However, as large language models are increasingly deployed in orchestrations of multiple agents, we must also study misalignment in multi-agent settings, or AI organizations. Our work examines two such settings: an AI consultancy providing practical solutions to business problems and an AI team writing software. We define two models of misalignment and find that AI organizations can be more effective than single agents at achieving productivity goals—but are also more willing to make ethical tradeoffs. Our results suggest that alignment of multiple AI agents is a challenging problem that does not simply follow from individual model alignment.

Authors (alphabetical): Erik Jones, Judy Hanwen Shen, Jascha Sohl-Dickstein, and Daniel Zhu

Location

Mox

1680 Mission St, San Francisco, CA 94103, USA

Presented by

Mox

Mox is a gathering place for startups on the cutting edge, EAs and others seeking to improve the world, AI & AI safety labs, indie researchers and builders, and writers; artists; & masters of craft!

Hosted By

121 Went