Cover Image for Reading Group (+πŸ§‹): Senior SWE-Bench
Cover Image for Reading Group (+πŸ§‹): Senior SWE-Bench
Avatar for Snorkel AI Community Events

Reading Group (+πŸ§‹): Senior SWE-Bench

Registration
Approval Required
Your registration is subject to host approval.
Welcome! To join the event, please register below.
About Event

​Join the Snorkel AI Reading Group, a recurring forum to explore the latest frontier developments in AI while building meaningful connections within the community.

​In this session, lead researcher Henry Ehrenberg will present Senior SWE-Bench, Snorkel's new open-source benchmark for evaluating coding agents on the work we actually give them.

​Agenda:

​4 pm - doors open
4:30 pm - talk begins

β€‹πŸ§‹πŸ§‹πŸ§‹ Boba tea and other refreshments will be provided ! πŸ§‹πŸ§‹πŸ§‹

​Among other things, you'll learn:

  • ​Why most coding benchmarks treat agents like junior engineers (over-specified requirements, graded mainly on whether the code runs) when most of us already treat agents like senior engineers who work from a Slack message, not a spec.

  • ​How Senior SWE-Bench's validation agent breaks the trade-off between realistic instructions and reliable grading: it writes behavioral tests adapted to each agent's actual solution, using expert-designed recipes the solving agent never sees.

  • ​Why the benchmark's bug and performance tasks are sourced from real PRs with evidence of significant runtime investigation (logs, profiling data, reproduction steps) β€” not tasks solvable by pattern-matching hints in the instructions.

  • ​How the taste judge scores tasteful solves on minimality, approach quality, hygiene, fluency, and craftsmanship relative to the reference implementation, so agents are rewarded for shipping the right code, not just code that passes tests.

  • ​Why even frontier models fall short: Claude Opus 4.8 leads the leaderboard at just 24.0% tasteful solve rate, meaning top models fail senior-level correctness and taste bars on more than three-quarters of tasks.

  • ​How the 100-task benchmark (50 public, 50 held out to guard against contamination) draws from 12 real repos spanning libraries to multi-service applications β€” from Postgres sync engines to self-hosted Git forges β€” authored by engineers with deep commit history in each codebase.

​Senior SWE-Bench is open-source and Harbor-compatible, developed by Snorkel AI in collaboration with Princeton and UW–Madison. Explore the leaderboard β†’ Β· Dataset on GitHub β†’

Location
101 Second Street
San Francisco, CA 94105, USA
Avatar for Snorkel AI Community Events