Evals aren’t enough: Testing Agentic AI the right way

LangWatch

Zoom

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Evaluating LLM-powered and agentic systems requires more than output scoring or static “golden datasets.” As agents introduce multi-step reasoning, tool calls, and changing user intent, teams need evaluation methods that test behavior, not just responses.

In this technical webinar, we’ll discuss how scenario-based evaluations and agent simulations enable test-driven development for AI systems. We’ll cover how to define quality upfront, validate agent flows end-to-end, detect regressions, and continuously test systems across experimentation, pre-production, and production.

Topics include:

Why input–output evals break down for agents
Scenario-based testing vs traditional LLM-as-judge metrics
Using simulations to test tool use, intent shifts, and failure modes
Aligning eval ownership between engineers and domain experts
Monitoring and regression testing for agentic systems

Speakers

Rogerio Chaves — Co-founder & CTO, LangWatch
Building evaluation, observability, and agent testing systems for production AI.

Ron Kremer — AI Consultant, ADC Data & AI
Designs and implements evaluation pipelines for enterprise AI systems; PhD researcher focused on applied AI.

Presented by

LangWatch

Hosted By

AI

Evals aren’t enough: Testing Agentic AI the right way

​Speakers

Speakers