

Fin x Mistral AI: Evaluating your AI Agent
Is your AI agent actually working? It's a deceptively simple question—and one of the hardest to answer honestly in production.
As AI agents move from prototype to core product, the gap between benchmark performance and real-world results is becoming impossible to ignore. The teams building at the frontier are finding that smaller, fine-tuned models often outperform bloated generalists—but only if you have the eval infrastructure to know what "better" actually looks like.
Join Fin and Mistral for a senior practitioner-level conversation spanning both ends of the stack—from frontier model development to production AI agents—on what it actually takes to evaluate, iterate, and trust AI in the real world.
What you'll learn:
Why offline metrics alone will mislead you, and what a production-grade eval framework actually looks like
When fine-tuned, specialised models outperform larger generalist ones — and what it takes to get there
How teams building at the model layer and the product layer think about evaluation differently
What building your own models teaches you about the limits of benchmarks
Speakers:
Pedro Tabacof, Principal Machine Learning Scientist, Fin
Henry Lagarde, Software Engineer, Mistral AI