Overview
Mooncake moves KV tensors across GPUs or nodes so that multiple inference servers can share prefilling work and latency.
Key Capabilities
- Trace-based prefill disaggregation
- P2P store & vLLM integration
- Transfer-engine plug-in architecture
Distributed KV-cache store & transfer engine that decouples prefilling from decoding to scale vLLM serving clusters.
ONNX (Open Neural Network Exchange) is an open ecosystem that provides an open source format for AI models, including deep learning and traditional ML. It defines an extensible computation graph model, built-in operators, and standard data types, focusing on inferencing capabilities. Widely supported across frameworks and hardware, it enables interoperability and accelerates AI innovation.