NICE Academy: Analyzing Uncertainty of LLM-as-a-Judge

Hosted by NICE AI Talk

YouTube

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Welcome to join the talk at NICE this Saturday!

This talk is about the hot and controversial topic on LLM as judge.

YouTube Livestream link: https://youtube.com/live/gOHEjeV-Dog

Speaker: Huanxin
Huanxin Sheng is a second-year PhD student in Computer Science at the University of Rochester, advised by Prof. Jian Kang. His current research interests focus on reliable evaluation of LLMs..

Personal website: https://brucesheng1202.github.io/

Abstract:
LLMs-as-judges have been increasingly adopted across a wide range of evaluation scenarios due to their efficiency and adaptability. However, their reliability is frequently questioned because of inherent biases and uncertainty. Even with same prompt, the scores could be inconsistent. In pointwise scoring, how can we quantify the uncertainty of LLM judges so as to obtain more reliable evaluations?

To address this question, we analyze the uncertainty in LLM-judge scoring through conformal prediction, and propose an ordinal boundary adjustment strategy to improve the efficiency and coverage of the resulting prediction intervals. We further suggest using the midpoint of these intervals as a low-bias alternative to the raw scores provided by LLM judges. In addition, we investigate the effectiveness of reprompting LLM judges using the constructed intervals.

Overall, this work advocates a shift from relying solely on direct scoring toward uncertainty-aware evaluation, offering a reliable reference for downstream decision-making.

Our Host: Haolun Wu

Haolun is a 4th-year PhD candidate at Mila & McGill and a visiting scholar at Stanford. His research interests include trustworthy AI / LLMs, information retrieval, personalization, human-AI alignment, and AI for education. He has interned at Microsoft Research, Google, and DeepMind, and his work has been deployed in the MSR Alexandria knowledge base construction and applied to Google Shopping recommendation platform. He has published in top venues across several areas (e.g., NeurIPS, ICML, ICLR, EMNLP, SIGIR, WWW, CHI, CSCW, TMLR, TKDE) and serves as a reviewer.

Personal Website: https://haolun-wu.github.io/

Hosted By

11 Went

AI