LLM-as-a-Judge 102: Meta Evaluation

Arize AI

Zoom

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

This session builds on LLM-as-a-Judge 101 and focuses on the essential practice of meta-evaluation: evaluating your evaluator to ensure your metrics are meaningful and trustworthy.

You’ll learn how to validate whether your LLM judge is measuring the right thing, how closely it aligns with human judgment, and how to identify where it fails. We’ll walk through comparing LLM and human annotations on a curated golden dataset, calculating precision/recall/F1, and inspecting disagreement cases to understand why your evaluator struggles.

We’ll also cover advanced techniques such as treating humans and LLMs as annotators to estimate inter-annotator agreement, and using high-temperature stress tests to detect prompt ambiguity or unstable reasoning.

Finally, you’ll learn how to use these insights to iteratively refine your evaluation—adjusting prompts, criteria, or data coverage one change at a time—until you’re confident your eval reflects human expectations.

This session will leave you with practical tools for building evals you can trust.

Presented by

Arize AI

Generative AI-focused workshops, hackathons, and more. Come build with us!

Hosted By

390 Went

AI