

Evals aren’t enough: Testing Agentic AI the right way
Evaluating LLM-powered and agentic systems requires more than output scoring or static “golden datasets.” As agents introduce multi-step reasoning, tool calls, and changing user intent, teams need evaluation methods that test behavior, not just responses.
In this technical webinar, we’ll discuss how scenario-based evaluations and agent simulations enable test-driven development for AI systems. We’ll cover how to define quality upfront, validate agent flows end-to-end, detect regressions, and continuously test systems across experimentation, pre-production, and production.
Topics include:
Why input–output evals break down for agents
Scenario-based testing vs traditional LLM-as-judge metrics
Using simulations to test tool use, intent shifts, and failure modes
Aligning eval ownership between engineers and domain experts
Monitoring and regression testing for agentic systems
Speakers
Rogerio Chaves — Co-founder & CTO, LangWatch
Building evaluation, observability, and agent testing systems for production AI.
Ron Kremer — AI Consultant, ADC Data & AI
Designs and implements evaluation pipelines for enterprise AI systems; PhD researcher focused on applied AI.