LMCache — High-performance KV Cache Layer for LLM Serving
Overview
LMCache is an open-source caching layer specifically built to speed up large language model (LLM) serving. It focuses on reusing computed KV caches (key/value activations) across different storage tiers (GPU, CPU DRAM, local disk) and across serving instances to reduce redundant computation, lower time-to-first-token (TTFT), and increase overall throughput. LMCache is designed to be engine-agnostic but offers deep integration with vLLM for optimized performance.
Key features
- High-performance KV cache management across multiple tiers: GPU, CPU, and disk.
- Cross-instance and P2P KV cache sharing to reuse computation across different serving processes or machines.
- Support for non-prefix KV caches (not limited to prefix-based reuse).
- Integrations with vLLM (native features like CPU KV offloading, disaggregated prefill, and P2P sharing).
- Multiple storage backends, including CPU, local disk, and NIXL.
- Installation via pip and compatibility notes for common GPU/Linux environments.
- Production-oriented tooling and ecosystem support (works with vLLM production stack, llm-d, KServe).
- Open-source license (Apache-2.0) and active community resources: docs, examples, blog, Slack, and bi-weekly community meetings.
Why it matters
Serving large models at scale—especially for long-context and multi-turn interactions—can be extremely costly because repeated token sequences or reused content require recomputing expensive key/value caches. LMCache stores and reuses these caches across storage tiers and instances so that reused text does not force full re-computation on the GPU. Combined with optimized serving engines (e.g., vLLM), LMCache can deliver substantial latency and GPU-cycle savings (authors report common improvements in the 3–10x range in many RAG and multi-round QA use cases).
Integrations & Ecosystem
- Deep integration with vLLM to provide CPU KV offloading, disaggregated prefill, and P2P sharing.
- Supported in production stacks such as the vLLM production stack and frameworks like llm-d and KServe, enabling easier enterprise deployment.
- Examples, quickstarts, and documentation available to help developers integrate LMCache into their serving pipelines.
Typical use cases
- Multi-round conversational agents where contexts or earlier turns are repeatedly accessed.
- Retrieval-augmented generation (RAG) where retrieved passages are reused across requests.
- High-throughput LLM inference deployments trying to minimize GPU compute and TTFT.
Installation & Getting Started
Installable via pip (pip install lmcache). The project provides detailed docs and quickstart examples that cover different serving engines, compatibility notes (torch/version mismatches), and example integrations with vLLM.
Community, Citation & License
LMCache is actively developed and documented (docs and blog). The repo includes citations to related academic work (Cachegen, CacheBlend, LMCache paper), is licensed under Apache-2.0, and maintains CI, tests, and community channels (Slack, YouTube recordings, meetings).
