

Evaluating AI Coding Agents with TeamCity and SWE-bench
Building a Reproducible CI Pipeline to Benchmark AI-Generated Code Fixes – Ernst Haagsman
AI coding agents are becoming part of the development workflow, but evaluating their performance reliably is challenging. In this webinar, we’ll show you how to use TeamCity and the SWE-bench benchmark to build a reproducible pipeline that runs AI agents on real-world tasks from open-source repositories and evaluates their outcomes.
You’ll learn how to:
Set up an automated evaluation pipeline that runs AI agents on real GitHub issues and validates their fixes with tests.
Ensure reproducibility with isolated environments and faster builds using Docker and TeamCity jobs.
Track meaningful metrics such as task success rate, costs, and agent performance across versions.
By the end of this workshop, you’ll be able to set up systematic benchmarking and regression testing of AI coding agents, enabling reproducible, scalable evaluation across hundreds of real-world tasks.
About the speaker:
Ernst Haagsman is a Product Leader at JetBrains, where he currently leads the strategy for TeamCity and the integration of AI into CI/CD workflows. Throughout his tenure at JetBrains, he has held key leadership roles, including Head of Product for IDE Services, where he focused on scaling developer tools for large organizations. With a professional background spanning software development, product marketing, and community management, Ernst brings a holistic perspective to building tools that improve the developer experience.
DataTalks.Club is the place to talk about data. Join our Slack community!
This event is sponsored by JetBrains.