Cover Image for Scaling Medical Evaluation of LLM Summaries: From PDSQI-9 to LLM-as-a-Judge
Cover Image for Scaling Medical Evaluation of LLM Summaries: From PDSQI-9 to LLM-as-a-Judge

Scaling Medical Evaluation of LLM Summaries: From PDSQI-9 to LLM-as-a-Judge

Hosted by Health AI Partnership (HAIP)
Zoom
Registration
Registration Closed
This event is not currently taking registrations. You may contact the host or subscribe to receive updates.
About Event

Health AI Hub Collaboratory Rounds bring together experts and practitioners to share and explore various aspects of AI lifecycle management in healthcare. These monthly gatherings are designed to keep you informed about the latest trends, best practices, and challenges in the dynamic field of healthcare AI.

Scaling Medical Evaluation of LLM Summaries: From PDSQI-9 to LLM-as-a-Judge

Electronic Health Records (EHRs) contain vast amounts of clinical data, yet providers often struggle to distill this information into clear and actionable insights. Large Language Models (LLMs) now offer the promise of automated summarization to reduce cognitive load, but ensuring the accuracy, safety, and reliability of these outputs is important for clinical use. In collaboration with Epic, our team developed and validated the Provider Documentation Summarization Quality Instrument (PDSQI-9) - a structured rubric for expert medical evaluation of LLM-generated summaries.

While human experts remain the gold standard for evaluation, this approach is resource-intensive and difficult to scale across real-world settings. To address this challenge, we then introduce LLM-as-a-Judge, an automated evaluation framework that benchmarks directly against PDSQI-9. Our results demonstrate that LLMs can achieve high inter-rater reliability with human evaluators while completing evaluations in seconds, enabling rapid, scalable quality assurance of AI outputs.

Speakers

Brian Patterson

Brian Patterson, MD, MPH is a tenured Associate Professor in the BerbeeWalsh Department of Emergency Medicine, as well as the inaugural Medical Informatics Director for Predictive Analytics and Clinical Decision Support at UW Health. In this role, Dr. Patterson works with hospital leadership to best use information technology to support clinical, educational, and research priorities. He advises on a range of issues, including optimizing the design, implementation, dissemination, evaluation, and routine use of clinical decision support and AI to improve healthcare quality, operational efficiency, educational programs, and research. 

Dr. Patterson’s research interests are in clinical informatics and geriatric emergency medicine. His work aims to use routinely collected clinical data to generate actionable insights to improve the quality and safety of emergency care for older adults. To achieve these goals, Dr. Patterson works in collaboration with investigators from the business and engineering schools as well as in the department of biostatistics and medical informatics.  

Dr.  Patterson graduated from Pennsylvania State University with a bachelor of science in Bioengineering and went on to complete a master’s degree in public health and a medical degree at the Northwestern University Feinberg School of Medicine. He continued at Northwestern for his residency training in emergency medicine, where he served as chief resident. 

Majid Afshar

Majid Afshar, MD, MS is a tenured Associate Professor in the Departments of Medicine and Biostatistics & Medical Informatics at the University of Wisconsin–Madison, where he co-leads the Critical Care Data Science Lab. He also serves as Director of the Learning Health System in the School of Medicine and Public Health, focusing on AI-driven interventions that directly improve healthcare delivery.

 Dr. Afshar has helped build a joint university–health system program to evaluate AI at the bedside, integrating clinical trials with health operations. He guided one of the first real-time large language model pipelines for clinical decision support and is leading initiatives to responsibly deploy generative AI technologies in health operations.

 As a physician-scientist, his research centers on early disease detection and prevention using clinical translational natural language processing and predictive analytics with electronic health record data, particularly for critically ill patients. He has organized national data challenges to advance diagnostic decision support, serves as a chartered NIH study section member for AI grants, and has co-chaired national informatics conferences.

Dr. Afshar has mentored numerous trainees to successful data science career development awards, co-authored the TRIPOD-LLM guidelines for reporting LLM-based research, and published in high-impact journals including Nature Medicine and NEJM AI.

Emma Croxford

Emma Croxford is a PhD student in Biomedical Data Science at the University of Wisconsin–Madison, where she is mentored by Dr. Majid Afshar. Her research focuses on methods for evaluating large language models in clinical settings, with particular attention to ensuring that generated documentation is accurate, clear, and useful for healthcare providers. She is interested in developing evaluation approaches that balance clinical safety with efficiency, supporting the responsible and trustworthy integration of generative AI into healthcare practice