Cover Image for Scalable Inference Algorithms for Large Language Models
Cover Image for Scalable Inference Algorithms for Large Language Models
Hosted By
4 Went

Scalable Inference Algorithms for Large Language Models

Hosted by Daniel Kang
Registration
Past Event
Welcome! To join the event, please register below.
About Event

Abstract
Inference efficiency is a key bottleneck in deploying large language models (LLMs) at scale, especially for applications that require long-context understanding or test-time scaling for improved reasoning.

In this seminar, Woomin Song (KAIST) will present two training-free inference frameworks that significantly reduce latency and memory costs while remaining fully compatible with existing models:

  1. REFORM (NeurIPS 2025): Enables efficient long-context inference by extending usable context length far beyond pretraining limits. It achieves high accuracy with reduced compute and memory overhead.

  2. STAND (EMNLP 2025): Accelerates test-time scaling methods (e.g., best-of-N sampling, tree search) through model-free speculative decoding, delivering substantial speedups without sacrificing accuracy.

Together, these approaches demonstrate how rethinking inference—rather than retraining or scaling models—can deliver practical gains in performance, cost, and deployability for real-world LLM systems.


Speaker Bio
Woomin Song is a Ph.D. student at KAIST AI, advised by Prof. Jinwoo Shin. His research focuses on building efficient machine learning systems, specifically targeting the reduction of inference costs for Large Language Models (LLMs).

He previously worked as an Applied Scientist Intern at Amazon AGI. He holds a B.S. in Electrical Engineering and Computer Science (double major) with a minor in Mathematics from KAIST (2022). His recent work on architectural modifications for computational efficiency has been accepted at top-tier conferences including NeurIPS and EMNLP.

🔗 Connect

Location
NS Library
Hosted By
4 Went