LogoAIAny
Icon for item

LMCache

LMCache is an open-source, high-performance KV (key-value) cache layer designed to accelerate LLM serving and inference, especially for long-context scenarios. By storing and reusing KV caches across GPU, CPU DRAM and local disk, and enabling cross-instance sharing, LMCache reduces time-to-first-token (TTFT) and GPU usage. It integrates tightly with vLLM, supports P2P cache sharing, non-prefix caches, multiple storage backends (CPU, disk, NIXL), and is distributed under Apache-2.0.

Introduction

LMCache — High-performance KV Cache Layer for LLM Serving

Overview

LMCache is an open-source caching layer specifically built to speed up large language model (LLM) serving. It focuses on reusing computed KV caches (key/value activations) across different storage tiers (GPU, CPU DRAM, local disk) and across serving instances to reduce redundant computation, lower time-to-first-token (TTFT), and increase overall throughput. LMCache is designed to be engine-agnostic but offers deep integration with vLLM for optimized performance.

Key features
  • High-performance KV cache management across multiple tiers: GPU, CPU, and disk.
  • Cross-instance and P2P KV cache sharing to reuse computation across different serving processes or machines.
  • Support for non-prefix KV caches (not limited to prefix-based reuse).
  • Integrations with vLLM (native features like CPU KV offloading, disaggregated prefill, and P2P sharing).
  • Multiple storage backends, including CPU, local disk, and NIXL.
  • Installation via pip and compatibility notes for common GPU/Linux environments.
  • Production-oriented tooling and ecosystem support (works with vLLM production stack, llm-d, KServe).
  • Open-source license (Apache-2.0) and active community resources: docs, examples, blog, Slack, and bi-weekly community meetings.
Why it matters

Serving large models at scale—especially for long-context and multi-turn interactions—can be extremely costly because repeated token sequences or reused content require recomputing expensive key/value caches. LMCache stores and reuses these caches across storage tiers and instances so that reused text does not force full re-computation on the GPU. Combined with optimized serving engines (e.g., vLLM), LMCache can deliver substantial latency and GPU-cycle savings (authors report common improvements in the 3–10x range in many RAG and multi-round QA use cases).

Integrations & Ecosystem
  • Deep integration with vLLM to provide CPU KV offloading, disaggregated prefill, and P2P sharing.
  • Supported in production stacks such as the vLLM production stack and frameworks like llm-d and KServe, enabling easier enterprise deployment.
  • Examples, quickstarts, and documentation available to help developers integrate LMCache into their serving pipelines.
Typical use cases
  • Multi-round conversational agents where contexts or earlier turns are repeatedly accessed.
  • Retrieval-augmented generation (RAG) where retrieved passages are reused across requests.
  • High-throughput LLM inference deployments trying to minimize GPU compute and TTFT.
Installation & Getting Started

Installable via pip (pip install lmcache). The project provides detailed docs and quickstart examples that cover different serving engines, compatibility notes (torch/version mismatches), and example integrations with vLLM.

Community, Citation & License

LMCache is actively developed and documented (docs and blog). The repo includes citations to related academic work (Cachegen, CacheBlend, LMCache paper), is licensed under Apache-2.0, and maintains CI, tests, and community channels (Slack, YouTube recordings, meetings).

Information

  • Websitegithub.com
  • AuthorsLMCache
  • Published date2024/05/28