Overview
LMDeploy integrates Patched-Triton back-end and web UI, delivering 10-15× speed-ups on InternLM and other models.
Key Capabilities
- PTQ & AWQ quantization flows
- Multi-GPU tensor & pipeline parallel
- OpenAI-compatible FastAPI server
Toolkit from InternLM for compressing, quantizing and serving LLMs with INT4/INT8 kernels on GPUs.