Cover Image for Reading Group (+🧋): Senior SWE-Bench

Presented by

Snorkel AI (snorkel.ai) develops the datasets, benchmarks, and evaluation methods that help AI and agentic systems learn, adapt, and perform in the real world.

Reading Group (+🧋): Senior SWE-Bench

Name: Reading Group (+🧋): Senior SWE-Bench
Start: 2026-07-15T16:00:00.000-07:00
End: 2026-07-15T18:30:00.000-07:00
Location: 101 Second Street

Snorkel AI Community Events

101 Second Street

San Francisco, CA

Approval Required

Your registration is subject to host approval.

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Join the Snorkel AI Reading Group, a recurring forum to explore the latest frontier developments in AI while building meaningful connections within the community.

In this session, lead researcher Henry Ehrenberg will present Senior SWE-Bench, Snorkel's new open-source benchmark for evaluating coding agents on the work we actually give them.

Agenda:

4 pm - doors open
4:30 pm - talk begins

🧋🧋🧋 Boba tea and other refreshments will be provided ! 🧋🧋🧋

Among other things, you'll learn:

Why most coding benchmarks treat agents like junior engineers (over-specified requirements, graded mainly on whether the code runs) when most of us already treat agents like senior engineers who work from a Slack message, not a spec.
How Senior SWE-Bench's validation agent breaks the trade-off between realistic instructions and reliable grading: it writes behavioral tests adapted to each agent's actual solution, using expert-designed recipes the solving agent never sees.
Why the benchmark's bug and performance tasks are sourced from real PRs with evidence of significant runtime investigation (logs, profiling data, reproduction steps) — not tasks solvable by pattern-matching hints in the instructions.
How the taste judge scores tasteful solves on minimality, approach quality, hygiene, fluency, and craftsmanship relative to the reference implementation, so agents are rewarded for shipping the right code, not just code that passes tests.
Why even frontier models fall short: Claude Opus 4.8 leads the leaderboard at just 24.0% tasteful solve rate, meaning top models fail senior-level correctness and taste bars on more than three-quarters of tasks.
How the 100-task benchmark (50 public, 50 held out to guard against contamination) draws from 12 real repos spanning libraries to multi-service applications — from Postgres sync engines to self-hosted Git forges — authored by engineers with deep commit history in each codebase.

Senior SWE-Bench is open-source and Harbor-compatible, developed by Snorkel AI in collaboration with Princeton and UW–Madison. Explore the leaderboard → · Dataset on GitHub →

Location

101 Second Street

San Francisco, CA 94105, USA

Presented by

Snorkel AI Community Events

Snorkel AI (snorkel.ai) develops the datasets, benchmarks, and evaluation methods that help AI and agentic systems learn, adapt, and perform in the real world.