

SRI Seminar Series: Zhijing Jin, “Emergent AI safety risks in multi-agent LLMs”
As AI systems take on more autonomous roles in the knowledge-work economy, they’ll increasingly interact with each other. However, will the AI agents coordinate for social good, or exploit rival agents and people in ways that put humans at serious risk?
In this talk I will explain how we assess these dangers with large-scale social simulations and game-theoretic analysis. We find that reasoning agents with sophisticated thinking often fail to sustain cooperation in a multitude of settings. Surprisingly, stronger reasoning capabilities often make models more prone to selfish strategies like free-riding. Finally, we present a framework that organizes multi-agent safety threats using well-established game-theoretic models, spanning multiple canonical dynamics grounded in diverse, realistic instantiations to probe robustness beyond any single setting. These strategic failures (where models’ decisions diverge from game-theoretic optimality) persist for state-of-the-art reasoning models, but various intervention mechanisms such as mediation by a neutral agent and agent-to-agent commitment protocols show a promising path towards pareto frontier in mutli-agent scenarios.
Sign up here: https://www.eventbrite.ca/e/sri-seminar-series-zhijing-jin-tickets-1977924987883?aff=oddtdtcreator