Cover Image for Emergent Misalignment from Reward Hacking

Presented by

Catalyzing Toronto's role in steering AI progress toward a future of human flourishing. Join us for a variety of events on technical AI safety, governance in a world of advanced AI, and more.

Hosted By

42 Went

AI

Featured in

Toronto

Emergent Misalignment from Reward Hacking

Name: Emergent Misalignment from Reward Hacking
Start: 2026-01-13T18:00:00.000-05:00
End: 2026-01-13T21:00:00.000-05:00
Location: 30 Adelaide St E

Trajectory Labs

30 Adelaide St E

Toronto, Ontario

Past Event

Welcome! Please choose your desired ticket type:

You will be asked to verify token ownership with your wallet.

About Event

Recent research from Anthropic and Redwood Research has shown that "reward hacking" is more than just a nuisance: it can be a seed for broader misalignment.

Evgenii Opryshko explores how models that learn to exploit vulnerabilities in coding environments can generalize to concerning capabilities, such as unprompted alignment faking and cooperating with malicious actors.

Event Schedule
6:00 to 6:30 - Food and introductions
6:30 to 7:30 - Presentation and Q&A
7:30 to 9:00 - Open Discussions

If you can't make it in person, feel free to join the live stream starting at 6:30 pm, via this link.

This is part of our weekly AI Safety series. Join us in examining questions like:

How do we ensure AI systems are aligned with human interests?
How do we measure and mitigate potential risks from advanced AI systems?
What does safer AI development look like?

Location