A vLLM Deep Dive: Distributed Inference, KV Cache Evolution, and Model Compression

Name: A vLLM Deep Dive: Distributed Inference, KV Cache Evolution, and Model Compression
Start: 2026-04-24T18:00:00+09:00
End: 2026-04-24T21:00:00+09:00
Location: Minato City, Tokyo

Tokyo vLLM Community Night: five deep-dive talks on distributed inference, KV cache, model compression, and production serving.

When: Fri, April 24, 2026 · 18:00–21:00 JST
Where: Minato City, Tokyo · In person
Region: Kanto (Tokyo)
Organizer: Tokyo AI
Language: EN
Source: Luma

Open in Luma Add to Calendar

Summary

vLLM Community Night — Tokyo brings together core vLLM contributors and production engineers for a technical evening focused on high-throughput, low-latency LLM inference. Hosted in Roppongi-Itchome, the event features five talks covering the current state of vLLM, the evolution of the KV cache as a distributed component, practical model compression via Fujitsu's OneComp framework, distributed inference on AWS (SageMaker HyperPod, ParallelCluster, EFA/SRD networking, prefill-decode disaggregation), and an in-the-trenches production post-mortem from Shisa.AI. Speakers include Tun Jian Tan (vLLM committer, Embedded LLM), Yuma Ichikawa (Fujitsu), Toshinobu Akazawa (AWS), Tony Valderrama (Momento), and Leonard Lin (Shisa.AI). The agenda pairs low-level optimization topics with system-level scaling and real-world benchmarks, evals, and hardware-specific tuning tricks. Light dinner and networking are included. The event is aimed at AI/ML engineers, LLM researchers, infrastructure and platform builders, AI product folks, and open-source contributors interested in the future of LLM serving.

About the community

Organized by Ilya Kulyatin (Tokyo AI / Foundry Labs) and Jiaqi Lim (Embedded LLM), hosted by Tokyo AI (TAI) — Japan's largest AI community with 4,000+ members — together with Embedded LLM, an AI infrastructure company and leading vLLM contributor that also builds the TokenVisor platform for metered, governed GPU services.

#vllm#llm-inference#kv-cache#model-compression#distributed-inference#ai-infrastructure#tokyo-ai#open-source#aws#mlops