

Testing Voice Agents like you test your chat agents
Voice agents aren't a separate species of software, they're agents with more ways to fail. Timing, interruptions, turn-taking, audio handling, and provider quirks all stack on top of the usual problems of intent, tool calls, and recovery. Yet most teams still "test" voice agents by calling them a few times and hoping for the best. That's fine for a demo. It's not a development loop.
In this webinar, we'll show how LangWatch Scenario now treats voice as a first-class testing target: the same model, the same scenario.run() API, and one more thing it can drive. You'll see how to define a caller, a goal, and success criteria, then run a full voice-to-voice conversation, judge the result against natural-language criteria, and inspect what actually happened through transcripts, traces, logs, and playback — because with voice, the interaction is the product.
We'll also cover why this matters for coding agents. Claude Code, Cursor, or Codex can change your voice agent's code, but without an executable spec they can't tell whether the call actually got better. Scenario gives them a pass/fail target and a real regression loop.
You'll walk away knowing how to:
Write a voice scenario against the stack you already ship on (OpenAI Realtime, ElevenLabs, Gemini Live, Pipecat, Twilio, ComposableVoice, or WebSocket)
Run voice tests headlessly in CI — no microphone or speakers required
Use judge-based evaluation to catch failures a transcript alone would miss (long pauses, talk-over, half-captured intent)
Give your coding agent a test harness it can iterate against
Define what good looks like. Run the call. Judge the result. Inspect the failure. Iterate. That's how voice agents become reliable software instead of impressive demos.