AIAny - LMCache

LMCache — High-performance KV Cache Layer for LLM Serving

Overview

LMCache is an open-source caching layer specifically built to speed up large language model (LLM) serving. It focuses on reusing computed KV caches (key/value activations) across different storage tiers (GPU, CPU DRAM, local disk) and across serving instances to reduce redundant computation, lower time-to-first-token (TTFT), and increase overall throughput. LMCache is designed to be engine-agnostic but offers deep integration with vLLM for optimized performance.

Key features

High-performance KV cache management across multiple tiers: GPU, CPU, and disk.
Cross-instance and P2P KV cache sharing to reuse computation across different serving processes or machines.
Support for non-prefix KV caches (not limited to prefix-based reuse).
Integrations with vLLM (native features like CPU KV offloading, disaggregated prefill, and P2P sharing).
Multiple storage backends, including CPU, local disk, and NIXL.
Installation via pip and compatibility notes for common GPU/Linux environments.
Production-oriented tooling and ecosystem support (works with vLLM production stack, llm-d, KServe).
Open-source license (Apache-2.0) and active community resources: docs, examples, blog, Slack, and bi-weekly community meetings.

Why it matters

Serving large models at scale—especially for long-context and multi-turn interactions—can be extremely costly because repeated token sequences or reused content require recomputing expensive key/value caches. LMCache stores and reuses these caches across storage tiers and instances so that reused text does not force full re-computation on the GPU. Combined with optimized serving engines (e.g., vLLM), LMCache can deliver substantial latency and GPU-cycle savings (authors report common improvements in the 3–10x range in many RAG and multi-round QA use cases).

Integrations & Ecosystem

Deep integration with vLLM to provide CPU KV offloading, disaggregated prefill, and P2P sharing.
Supported in production stacks such as the vLLM production stack and frameworks like llm-d and KServe, enabling easier enterprise deployment.
Examples, quickstarts, and documentation available to help developers integrate LMCache into their serving pipelines.

Typical use cases

Multi-round conversational agents where contexts or earlier turns are repeatedly accessed.
Retrieval-augmented generation (RAG) where retrieved passages are reused across requests.
High-throughput LLM inference deployments trying to minimize GPU compute and TTFT.

Installation & Getting Started

Installable via pip (pip install lmcache). The project provides detailed docs and quickstart examples that cover different serving engines, compatibility notes (torch/version mismatches), and example integrations with vLLM.

Community, Citation & License

LMCache is actively developed and documented (docs and blog). The repo includes citations to related academic work (Cachegen, CacheBlend, LMCache paper), is licensed under Apache-2.0, and maintains CI, tests, and community channels (Slack, YouTube recordings, meetings).

LMCache

Introduction

LMCache — High-performance KV Cache Layer for LLM Serving

Overview

Key features

Why it matters

Integrations & Ecosystem

Typical use cases

Installation & Getting Started

Community, Citation & License

Information

Categories

Tags

More Items

MiroThinker

Memvid

Chef (by Convex)