

Measuring What Works: Agent Evals, Context Quality, and Optimization
If you can’t measure it, you can’t improve it, especially with agents.
Most teams rely on vibes, anecdotes, or raw model benchmarks to judge agent performance. That breaks down fast in real developer workflows.
This session goes deep on evaluation and optimization. We’ll show how to define meaningful grading criteria and measure what actually improves agent outcomes in production.
You’ll learn how to evaluate agent performance, quantify the impact of different context packages, and turn failures into a continuous improvement loop.
Expect a practical view of what “agent performance” really means.
What you’ll learn
How to design realistic, repeatable agent evaluation tasks
Grading criteria that reflect real developer success
Ways to measure the impact of docs, rules, and examples on outcomes
Turning production failures into feedback that improves context over time
Speakers
Dru Knox
Head of Product, Tessl
Dru leads Product at Tessl. He brings deep experience in platform and ecosystem development, having helped build two of the largest developer platforms in the world, Android and the web. His work sits at the intersection of product design, developer experience, and systems thinking. Outside of work, he’s drawn to design, game theory, and a bit of armchair philosophy.
Maria Gorinova
Member of Technical Staff, Tessl
Maria is an AI Research Engineer at Tessl. Her experience spans machine learning and computer science, including probabilistic programming, variational inference, graph neural networks, geometric deep learning, programming language design, and program analysis, with applications across science, healthcare, and social media.
Who this is for
Engineers, researchers, platform teams, and technical leaders who want evidence-based answers to what actually makes agents better.