

A vLLM Deep Dive: Distributed Inference, KV Cache Evolution, and Model Compression
βvLLM Community Night β Tokyo π―π΅
βThis technical gathering brings together the core contributors and lead engineers behind some of the most efficient LLM deployments in the industry. The session focuses on the evolution of vLLM, exploring its internal mechanics such as KV cache optimization and model compression. We'll discuss real-world production challenges like multi-cloud scaling and heterogeneous hardware tuning.
βDesigned for a technical audience of researchers and engineers, the evening aims to provide insights into the future of high-throughput, low-latency inference.
βπ Roppongi-Itchome, Tokyo
π
Friday, 24 April 2026
π 6:00 PM β 9:00 PM (light dinner + networking included!)
βπ What's in Store
ββ‘ Deep Dive into vLLM
βGet insider insights straight from a core vLLM contributor on how vLLM is powering the next generation of LLM performance. Whether you're already running vLLM in prod or just getting started, you'll walk away with something new.
βWe begin with the Project Update to establish the current state of vLLM. We then move into the low-level technical optimizations (KV Cache and Compression), which serve as the foundation for deployment. We conclude with system-level scaling and the comprehensive production post-mortem, moving from theoretical components to large-scale operational reality.
βπ₯ Talk 1 - Intro to vLLM and Project Update - Tun Jian, Tan (Committer, vLLM, AI Engineer, Embedded LLM)
βKick off the evening with insider insights from vLLM maintainer on how vLLM is powering the next generation of high-performance LLM serving. βWhether you are just getting started or already running models in production, discover the latest project updates, new features, and the engine's future roadmap.
βπ₯ Talk 2 - Evolution of the KV Cache in vLLM - Tony Valderrama (Head of Product, Momento)
βTrace the evolution of the KV Cache from a simple optimization into a distinct, fully-distributed component of inference systems. Learn about current state-of-the-art solutions, like LMCache and Mooncake, as we lay out a roadmap for incremental adoption at scale.
βπ₯ Talk 3 - Practical AI Model Compression with OneComp - Yuma Ichikawa (Senior Research Manager, Fujitsu)
βThis talk introduces OneComp, an open-source framework for practical post-training compression of generative AI models. I will cover how it automates model inspection, mixed-precision planning, and progressive quantization to make deployment more efficient while maintaining model quality.Β
βπ₯ Talk 4 - Distributed Inference with vLLM on AWS - Toshinobu Akazawa (Solutions Architect, AWS)
βThis session explores architectures for deploying efficient distributed LLM inference using vLLM on AWS. I will first discuss ML infrastructure options, such as Amazon SageMaker HyperPod and AWS ParallelCluster, along with the role of EFA/SRD networking in achieving low-latency GPU communication. The core of the session focuses on Prefill-Decode Disaggregated Inference in vLLM.
βπ₯ Talk 5 - vLLM in Production: From Quants to QPS - Leonard Lin (CTO, Shisa.AI)
βHear from the CTO of Shisa.AI, running all-Japan production inference across diverse model types, heterogeneous hardware, and multicloud infrastructure. This session goes deep β benchmarks, evals, quality vs. performance tradeoffs, hardware-specific tricks, and how they tune their serving architecture for real-world QPS demands. No fluff, all signal.
βπ₯ Networking, Food & Drinks
βWind down with fellow AI builders and researchers over good food and even better conversations. Whether you're looking for collaborators, hiring, or just want to geek out β this is your time.
βπ₯ Who Should Come?
βAI/ML engineers, LLM researchers, infrastructure & platform builders, AI product folks, open-source contributors and anyone curious about where AI is heading next.
βAgenda
β18:00: Doors Open & Registration
β18:30: vLLM Core Update (Tun Jian, Tan)
β18:45: KV Cache Evolution (Tony Valderrama)
β19:05: Model Compression with OneComp (Yuma Ichikawa)
β19:25: Distributed Inference on AWS (Toshinobu Akazawa)
β19:45: vLLM in Production Post-Mortem (Leonard Lin)
β20:05: Networking, Food & Drinks
βOrganizers
βββββββIlya Kulyatin is an entrepreneur with work and academic experience in the US, Netherlands, Singapore, UK, and Japan. He holds a BA in Economics, an MA in Finance, and an MSc in Machine Learning. He's a 3x founder, now helping Japan grow the local AI ecosystem through a not-for-profit community, Tokyo AI (TAI), while building an AI-native system integrator and solutions provider, Foundry Labsζ ͺεΌδΌη€Ύ.
βJiaqi Lim is the Marketing Lead at Embedded LLM, blending a strong technical foundation in Computer Science with a deep passion for community building. Thriving on the end-to-end process of bringing an event to life, she excels at turning detailed planning into seamless, successful experiences. The ultimate driving force behind Jiaqiβs work is creating collaborative spaces where individuals from diverse backgrounds can gather to connect, exchange perspectives, and share valuable knowledge and experiences.
βSupporters
βββTokyo AI (βββTAI) is the biggest AI community in Japan, with 4,000+ members mainly based in Tokyo (engineers, researchers, investors, product managers, and corporate innovation managers).
βEmbedded LLM is an AI infrastructure company with teams in Singapore, Taiwan, and Vilnius, Lithuania. The company is a leading contributor to vLLM, the world's most widely deployed open-source LLM inference engine, and builds TokenVisor, the commercial platform that turns GPU infrastructure into a metered, governed AI service for enterprises and governments.
ββPrivacy Policy
ββWe will process your email address for the purposes of event-related communications and ongoing newsletter communications. You may unsubscribe from the newsletter at any time. Further details on how we process personal data are available in our Privacy Policy.