Cover Image for Emergent Misalignment from Reward Hacking
Cover Image for Emergent Misalignment from Reward Hacking
Avatar for Trajectory Labs
Presented by
Trajectory Labs
Hosted By
18 Going
Get Tickets
Welcome! Please choose your desired ticket type:
About Event

Recent research from Anthropic and Redwood Research has shown that "reward hacking" is more than just a nuisance: it can be a seed for broader misalignment.

Evgenii Opryshko explores how models that learn to exploit vulnerabilities in coding environments can generalize to concerning capabilities, such as unprompted alignment faking and cooperating with malicious actors.

Event Schedule
6:00 to 6:30 - Food and introductions
6:30 to 7:30 - Presentation and Q&A
7:30 to 9:00 - Open Discussions

​​​​If you can't make it in person, feel free to join the live stream starting at 6:30 pm, via this link.

​This is part of our weekly AI Safety series. Join us in examining questions like: 

  • ​How do we ensure AI systems are aligned with human interests? 

  • ​How do we measure and mitigate potential risks from advanced AI systems? 

  • ​What does safer AI development look like?

Location
30 Adelaide St E
Toronto, ON M5C 3G8, Canada
Enter the main lobby of the building and let the security staff know you are here for the AI event. You may need to show your RSVP on your phone. You will be directed to the 12th floor where the meetup is held. If you have trouble getting in, give Georgia a call at 519-981-0360.
Avatar for Trajectory Labs
Presented by
Trajectory Labs
Hosted By
18 Going