In person
A vLLM Deep Dive: Distributed Inference, KV Cache Evolution, and Model Compression
Tokyo vLLM Community Night: five deep-dive talks on distributed inference, KV cache, model compression, and production serving.
- When
- Fri, April 24, 2026 · 18:00–21:00 JST
- Where
- Minato City, Tokyo
- Region
- Kanto (Tokyo)
- Organizer
- Tokyo AI
- Language
- EN
- Source
- Luma
Summary
vLLM Community Night — Tokyo brings together core vLLM contributors and production engineers for a technical evening focused on high-throughput, low-latency LLM inference. Hosted in Roppongi-Itchome, the event features five talks covering the current state of vLLM, the evolution of the KV cache as a distributed component, practical model compression via Fujitsu's OneComp framework, distributed inference on AWS (SageMaker HyperPod, ParallelCluster, EFA/SRD networking, prefill-decode disaggregation), and an in-the-trenches production post-mortem from Shisa.AI.
Speakers include Tun Jian Tan (vLLM committer, Embedded LLM), Yuma Ichikawa (Fujitsu), Toshinobu Akazawa (AWS), Tony Valderrama (Momento), and Leonard Lin (Shisa.AI). The agenda pairs low-level optimization topics with system-level scaling and real-world benchmarks, evals, and hardware-specific tuning tricks.
Light dinner and networking are included. The event is aimed at AI/ML engineers, LLM researchers, infrastructure and platform builders, AI product folks, and open-source contributors interested in the future of LLM serving.
About the community
Organized by Ilya Kulyatin (Tokyo AI / Foundry Labs) and Jiaqi Lim (Embedded LLM), hosted by Tokyo AI (TAI) — Japan's largest AI community with 4,000+ members — together with Embedded LLM, an AI infrastructure company and leading vLLM contributor that also builds the TokenVisor platform for metered, governed GPU services.
#vllm#llm-inference#kv-cache#model-compression#distributed-inference#ai-infrastructure#tokyo-ai#open-source#aws#mlops