Cover Image for LLM-as-a-Judge 102: Meta Evaluation
Cover Image for LLM-as-a-Judge 102: Meta Evaluation
Avatar for Arize AI
Presented by
Arize AI
Generative AI-focused workshops, hackathons, and more. Come build with us!
324 Going

LLM-as-a-Judge 102: Meta Evaluation

Zoom
Registration
Welcome! To join the event, please register below.
About Event

This session builds on LLM-as-a-Judge 101 and focuses on the essential practice of meta-evaluation: evaluating your evaluator to ensure your metrics are meaningful and trustworthy.

You’ll learn how to validate whether your LLM judge is measuring the right thing, how closely it aligns with human judgment, and how to identify where it fails. We’ll walk through comparing LLM and human annotations on a curated golden dataset, calculating precision/recall/F1, and inspecting disagreement cases to understand why your evaluator struggles.

We’ll also cover advanced techniques such as treating humans and LLMs as annotators to estimate inter-annotator agreement, and using high-temperature stress tests to detect prompt ambiguity or unstable reasoning.

Finally, you’ll learn how to use these insights to iteratively refine your evaluation—adjusting prompts, criteria, or data coverage one change at a time—until you’re confident your eval reflects human expectations.


This session will leave you with practical tools for building evals you can trust.

Avatar for Arize AI
Presented by
Arize AI
Generative AI-focused workshops, hackathons, and more. Come build with us!
324 Going