Overview
LMDeploy integrates Patched-Triton back-end and web UI, delivering 10-15× speed-ups on InternLM and other models.
Key Capabilities
- PTQ & AWQ quantization flows
- Multi-GPU tensor & pipeline parallel
- OpenAI-compatible FastAPI server
Toolkit from InternLM for compressing, quantizing and serving LLMs with INT4/INT8 kernels on GPUs.
ONNX (Open Neural Network Exchange) is an open ecosystem that provides an open source format for AI models, including deep learning and traditional ML. It defines an extensible computation graph model, built-in operators, and standard data types, focusing on inferencing capabilities. Widely supported across frameworks and hardware, it enables interoperability and accelerates AI innovation.