

Stop Hoping, Start Evaluating: Building AI Agents That Actually Work
Session description: Tezan Sahu | LinkedIn
The AI agents revolution promises autonomous systems that reason, act, and deliver value—but most production deployments tell a darker story. Industry data suggests over 70% of AI agent projects fail to meet expectations, not due to lack of sophistication, but due to a fundamental gap: the absence of rigorous, systematic evaluation frameworks. Organizations invest heavily in prompt engineering, model selection, and tool integration, yet rely on hope rather than measurement to validate agent behavior.
This session presents a paradigm shift from intuition-driven to evaluation-driven development (EDD) for AI agents. Drawing from real-world experience shipping production agents at Microsoft, this talk demonstrates why traditional software testing approaches catastrophically fail for non-deterministic AI systems and introduce a comprehensive framework for building agents that actually work.
The talk explores the anatomical elements of AI agents and reveals critical evaluation dimensions: task success, process quality, reliability, safety, and efficiency. It presents a battle-tested methodology combining rule-based evaluators with LLM-as-a-judge approaches, progressing through offline, shadow, and online evaluation stages.
Key insights include test dataset design principles covering edge cases and adversarial scenarios; metrics that matter beyond simple accuracy scores; failure pattern analysis for prioritizing fixes; and the velocity multiplier effect of robust eval pipelines. Attendees will learn to diagnose reasoning errors, tool misuse, cost inefficiencies, and safety gaps through data rather than guesswork.
The core message: Evals velocity determines product velocity.
In an era where every organization races to deploy AI agents, the winners won't be those with the most advanced models—they'll be those who can measure, diagnose, and iterate fastest. This keynote equips technical leaders, AI practitioners, and data scientists with actionable frameworks to move from hoping their agents work to knowing they work.
Speaker Bio:
Tezan Sahu is an Applied Scientist II at Microsoft, building M365 Copilot Extensibility experiences. With 4+ years of experience shipping production AI across Bing & Copilot platforms, he focuses on the real engineering challenges of taking AI from prototype to scale & is a subject-matter expert in AI evaluation.
A distinguished alumnus of IIT Bombay (Department Rank 1, Institute Rank 2), Tezan holds multiple patents and is the author of Beyond Code, a national bestseller. He also curates Low-Pass Filter, a widely read AI newsletter that distills noisy, fast-moving research and industry trends into clear, practical insights for builders and leaders.
As a speaker and mentor, he has empowered 20,000+ students and professionals worldwide. His work bridges cutting-edge research with building AI systems that deliver real business impact.