

Reading Group (+π§): Senior SWE-Bench
βJoin the Snorkel AI Reading Group, a recurring forum to explore the latest frontier developments in AI while building meaningful connections within the community.
βIn this session, lead researcher Henry Ehrenberg will present Senior SWE-Bench, Snorkel's new open-source benchmark for evaluating coding agents on the work we actually give them.
βAgenda:
β4 pm - doors open
4:30 pm - talk begins
βπ§π§π§ Boba tea and other refreshments will be provided ! π§π§π§
βAmong other things, you'll learn:
βWhy most coding benchmarks treat agents like junior engineers (over-specified requirements, graded mainly on whether the code runs) when most of us already treat agents like senior engineers who work from a Slack message, not a spec.
βHow Senior SWE-Bench's validation agent breaks the trade-off between realistic instructions and reliable grading: it writes behavioral tests adapted to each agent's actual solution, using expert-designed recipes the solving agent never sees.
βWhy the benchmark's bug and performance tasks are sourced from real PRs with evidence of significant runtime investigation (logs, profiling data, reproduction steps) β not tasks solvable by pattern-matching hints in the instructions.
βHow the taste judge scores tasteful solves on minimality, approach quality, hygiene, fluency, and craftsmanship relative to the reference implementation, so agents are rewarded for shipping the right code, not just code that passes tests.
βWhy even frontier models fall short: Claude Opus 4.8 leads the leaderboard at just 24.0% tasteful solve rate, meaning top models fail senior-level correctness and taste bars on more than three-quarters of tasks.
βHow the 100-task benchmark (50 public, 50 held out to guard against contamination) draws from 12 real repos spanning libraries to multi-service applications β from Postgres sync engines to self-hosted Git forges β authored by engineers with deep commit history in each codebase.
βSenior SWE-Bench is open-source and Harbor-compatible, developed by Snorkel AI in collaboration with Princeton and UWβMadison. Explore the leaderboard β Β· Dataset on GitHub β