Cover Image for A vLLM Deep Dive: Distributed Inference, KV Cache Evolution, and Model Compression
Cover Image for A vLLM Deep Dive: Distributed Inference, KV Cache Evolution, and Model Compression
Avatar for Tokyo AI (TAI)
Presented by
Tokyo AI (TAI)

A vLLM Deep Dive: Distributed Inference, KV Cache Evolution, and Model Compression

Register to See Address
Minato City, Tokyo
Registration
Approval Required
Your registration is subject to host approval.
Welcome! To join the event, please register below.
About Event

​vLLM Community Night β€” Tokyo πŸ‡―πŸ‡΅

​This technical gathering brings together the core contributors and lead engineers behind some of the most efficient LLM deployments in the industry. The session focuses on the evolution of vLLM, exploring its internal mechanics such as KV cache optimization and model compression. We'll discuss real-world production challenges like multi-cloud scaling and heterogeneous hardware tuning.

​Designed for a technical audience of researchers and engineers, the evening aims to provide insights into the future of high-throughput, low-latency inference.

β€‹πŸ“ Roppongi-Itchome, Tokyo
πŸ“… Friday, 24 April 2026
πŸ•• 6:00 PM – 9:00 PM (light dinner + networking included!)


β€‹πŸŒŸ What's in Store

β€‹βš‘ Deep Dive into vLLM

​Get insider insights straight from a core vLLM contributor on how vLLM is powering the next generation of LLM performance. Whether you're already running vLLM in prod or just getting started, you'll walk away with something new.

​We begin with the Project Update to establish the current state of vLLM. We then move into the low-level technical optimizations (KV Cache and Compression), which serve as the foundation for deployment. We conclude with system-level scaling and the comprehensive production post-mortem, moving from theoretical components to large-scale operational reality.

​πŸ”₯ Talk 1 - Intro to vLLM and Project Update - Tun Jian, Tan (Committer, vLLM, AI Engineer, Embedded LLM)

​Kick off the evening with insider insights from vLLM maintainer on how vLLM is powering the next generation of high-performance LLM serving. ​Whether you are just getting started or already running models in production, discover the latest project updates, new features, and the engine's future roadmap.

​πŸ”₯ Talk 2 - Evolution of the KV Cache in vLLM - Tony Valderrama (Head of Product, Momento)

​Trace the evolution of the KV Cache from a simple optimization into a distinct, fully-distributed component of inference systems. Learn about current state-of-the-art solutions, like LMCache and Mooncake, as we lay out a roadmap for incremental adoption at scale.

​πŸ”₯ Talk 3 - Practical AI Model Compression with OneComp - Yuma Ichikawa (Senior Research Manager, Fujitsu)

​This talk introduces OneComp, an open-source framework for practical post-training compression of generative AI models. I will cover how it automates model inspection, mixed-precision planning, and progressive quantization to make deployment more efficient while maintaining model quality.Β 

​πŸ”₯ Talk 4 - Distributed Inference with vLLM on AWS - Toshinobu Akazawa (Solutions Architect, AWS)

​This session explores architectures for deploying efficient distributed LLM inference using vLLM on AWS. I will first discuss ML infrastructure options, such as Amazon SageMaker HyperPod and AWS ParallelCluster, along with the role of EFA/SRD networking in achieving low-latency GPU communication. The core of the session focuses on Prefill-Decode Disaggregated Inference in vLLM.

​πŸ”₯ Talk 5 - vLLM in Production: From Quants to QPS - Leonard Lin (CTO, Shisa.AI)

​Hear from the CTO of Shisa.AI, running all-Japan production inference across diverse model types, heterogeneous hardware, and multicloud infrastructure. This session goes deep β€” benchmarks, evals, quality vs. performance tradeoffs, hardware-specific tricks, and how they tune their serving architecture for real-world QPS demands. No fluff, all signal.

​πŸ₯‚ Networking, Food & Drinks

​Wind down with fellow AI builders and researchers over good food and even better conversations. Whether you're looking for collaborators, hiring, or just want to geek out β€” this is your time.


​πŸ‘₯ Who Should Come?

​AI/ML engineers, LLM researchers, infrastructure & platform builders, AI product folks, open-source contributors and anyone curious about where AI is heading next.


​Agenda

​18:00: Doors Open & Registration

​18:30: vLLM Core Update (Tun Jian, Tan)

​18:45: KV Cache Evolution (Tony Valderrama)

​19:05: Model Compression with OneComp (Yuma Ichikawa)

​19:25: Distributed Inference on AWS (Toshinobu Akazawa)

​19:45: vLLM in Production Post-Mortem (Leonard Lin)

​20:05: Networking, Food & Drinks

​Organizers

​​​​​​​Ilya Kulyatin is an entrepreneur with work and academic experience in the US, Netherlands, Singapore, UK, and Japan. He holds a BA in Economics, an MA in Finance, and an MSc in Machine Learning. He's a 3x founder, now helping Japan grow the local AI ecosystem through a not-for-profit community, Tokyo AI (TAI), while building an AI-native system integrator and solutions provider, Foundry Labsζ ͺ式会瀾.

​Jiaqi Lim is the Marketing Lead at Embedded LLM, blending a strong technical foundation in Computer Science with a deep passion for community building. Thriving on the end-to-end process of bringing an event to life, she excels at turning detailed planning into seamless, successful experiences. The ultimate driving force behind Jiaqi’s work is creating collaborative spaces where individuals from diverse backgrounds can gather to connect, exchange perspectives, and share valuable knowledge and experiences.

​Supporters

​​​Tokyo AI (​​​TAI) is the biggest AI community in Japan, with 4,000+ members mainly based in Tokyo (engineers, researchers, investors, product managers, and corporate innovation managers).

​Embedded LLM is an AI infrastructure company with teams in Singapore, Taiwan, and Vilnius, Lithuania. The company is a leading contributor to vLLM, the world's most widely deployed open-source LLM inference engine, and builds TokenVisor, the commercial platform that turns GPU infrastructure into a metered, governed AI service for enterprises and governments.

​​Privacy Policy

​​We will process your email address for the purposes of event-related communications and ongoing newsletter communications. You may unsubscribe from the newsletter at any time. Further details on how we process personal data are available in our Privacy Policy.

Location
Please register to see the exact location of this event.
Minato City, Tokyo
Avatar for Tokyo AI (TAI)
Presented by
Tokyo AI (TAI)