

LLM-as-a-Judge 102: Meta Evaluation
This session builds on LLM-as-a-Judge 101 and focuses on the essential practice of meta-evaluation: evaluating your evaluator to ensure your metrics are meaningful and trustworthy.
You’ll learn how to validate whether your LLM judge is measuring the right thing, how closely it aligns with human judgment, and how to identify where it fails. We’ll walk through comparing LLM and human annotations on a curated golden dataset, calculating precision/recall/F1, and inspecting disagreement cases to understand why your evaluator struggles.
We’ll also cover advanced techniques such as treating humans and LLMs as annotators to estimate inter-annotator agreement, and using high-temperature stress tests to detect prompt ambiguity or unstable reasoning.
Finally, you’ll learn how to use these insights to iteratively refine your evaluation—adjusting prompts, criteria, or data coverage one change at a time—until you’re confident your eval reflects human expectations.
This session will leave you with practical tools for building evals you can trust.