

W2: Mastering AI/LLM Evaluations
You've shipped an AI feature. How do you know it's working? How do you catch regressions before your users do? How do you compare two prompt versions without guessing?
Most teams skip this step entirely. It's why most AI systems feel unreliable. This workshop fixes that.
What we'll cover:
→ Why standard metrics (BLEU, ROUGE, accuracy) break for generative AI — and what to use instead
→ Building golden datasets: how to collect, curate, and version your test cases
→ Programmatic evals vs LLM-as-a-Judge — when to use each and how to combine them
→ Writing judge prompts that align with human judgment
→ A/B testing prompts and models at production scale
→ Catching regressions before they reach users
→ Offline vs online evaluation, tracing, and production monitoring
→ Tools: DeepEval, LangSmith patterns, Braintrust-style logging — all accessible, no enterprise budget required
You'll leave with:
A working eval suite — built during the workshop, directly applicable to your own AI project. Judge prompt templates, dataset structures, and a checklist for production evals.
Who this is for:
AI engineers, product teams shipping AI features, ML practitioners, and anyone who needs to know their AI system actually works reliably. Basic familiarity with LLM concepts helpful. Coding not required for most sessions — Colab notebooks provided.
📅 Tuesday 23 June 2026 🕘 9:00am – 5:00pm
📍 Stone & Chalk Tech Central, Haymarket
🎟 Early bird pricing closes 22 May — price increases after.
💬 DataEngBytes member? Use code DEB10 at checkout for 10% off.
Bring a laptop. Lunch and materials included.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 🗓 PART OF AI MASTERY WEEK — 22–26 JUNE
Mon 22 → W1: Prompt Engineering Mastery
Tue 23 → W2: Mastering AI/LLM Evaluations ← you're here
Wed 24 → W3: Building Practical AI Agents
Thu–Fri → W4: Enterprise AI Architecture
🎁 Book all four and save 15%: AI Mastery Week Bundle
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Hosted by Peter Hanssens, founder of Cloud Shuttle and DataEngBytes — ANZ's largest data engineering community conference.