Mooncake

Distributed KV-cache store & transfer engine that decouples prefilling from decoding to scale vLLM serving clusters.

Visit Website

Introduction

Overview

Mooncake moves KV tensors across GPUs or nodes so that multiple inference servers can share prefilling work and latency.

Key Capabilities

Trace-based prefill disaggregation
P2P store & vLLM integration
Transfer-engine plug-in architecture

Back

Information

Websitekvcache-ai.github.io
AuthorsKVCache-AI Team
Published date2024/11/28

More Items

Foundry Local

2024

Microsoft

Foundry Local is an open-source tool from Microsoft that enables running generative AI models on local devices without an Azure subscription. It supports on-device processing for privacy and security, integrates models via an OpenAI-compatible API, and optimizes performance using ONNX Runtime and hardware acceleration.

microsoft ai-inference ai-serving llm ai-client+2

ONNX

2017

ONNX Project Contributors, Meta (Facebook) +1

ONNX (Open Neural Network Exchange) is an open ecosystem that provides an open source format for AI models, including deep learning and traditional ML. It defines an extensible computation graph model, built-in operators, and standard data types, focusing on inferencing capabilities. Widely supported across frameworks and hardware, it enables interoperability and accelerates AI innovation.

ai-framework mlops ai-inference ai-serving pytorch+2

LightX2V

2025

LightX2V Contributors, ModelTC

LightX2V is an advanced lightweight video generation inference framework engineered to deliver efficient, high-performance video synthesis solutions. This unified platform integrates multiple state-of-the-art video generation techniques, supporting diverse generation tasks including text-to-video (T2V) and image-to-video (I2V). X2V represents the transformation of different input modalities (X, such as text or images) into video output (V).

github ai-video ai-tools ai-inference huggingface+2