Cover Image for Evals aren’t enough: Testing Agentic AI the right way
Cover Image for Evals aren’t enough: Testing Agentic AI the right way
Avatar for LangWatch
Presented by
LangWatch
1 Going

Evals aren’t enough: Testing Agentic AI the right way

Google Meet
Registration
Welcome! To join the event, please register below.
About Event

Evaluating LLM-powered and agentic systems requires more than output scoring or static “golden datasets.” As agents introduce multi-step reasoning, tool calls, and changing user intent, teams need evaluation methods that test behavior, not just responses.

In this technical webinar, we’ll discuss how scenario-based evaluations and agent simulations enable test-driven development for AI systems. We’ll cover how to define quality upfront, validate agent flows end-to-end, detect regressions, and continuously test systems across experimentation, pre-production, and production.

Topics include:

  • Why input–output evals break down for agents

  • Scenario-based testing vs traditional LLM-as-judge metrics

  • Using simulations to test tool use, intent shifts, and failure modes

  • Aligning eval ownership between engineers and domain experts

  • Monitoring and regression testing for agentic systems


Speakers

Rogerio Chaves — Co-founder & CTO, LangWatch
Building evaluation, observability, and agent testing systems for production AI.

Ron Kremer — AI Consultant, ADC Data & AI
Designs and implements evaluation pipelines for enterprise AI systems; PhD researcher focused on applied AI.

Avatar for LangWatch
Presented by
LangWatch
1 Going