AI Deploy2023

MLC LLM

Compiles one LLM into device-native binaries running on CUDA, ROCm, Metal, Vulkan, WebGPU, and CPU — same model from server to browser to phone. On Apache TVM, it ships MLCEngine with an OpenAI-compatible API across Python, JS, REST, iOS, and Android.

Visit Website

Introduction

Most LLM stacks assume a CUDA datacenter; the awkward truth is that the same weights rarely follow you to a laptop GPU, a browser tab, or a phone. MLC LLM treats deployment as a compilation problem instead of a runtime one: it lowers a model through Apache TVM into a hardware-specific binary, so the place a model runs becomes a build target rather than a rewrite.

What Sets It Apart

One source model, many native binaries — CUDA, ROCm, Metal, Vulkan, OpenCL, WebGPU, and CPU all come out of the same compilation flow, so adding a backend is a retarget, not a port.
MLCEngine unifies these targets behind one OpenAI-compatible API exposed through REST, Python, JavaScript, iOS, and Android, meaning client code written against the cloud also drives the on-device build.
Because TVM emits self-contained binaries, models run fully local — including inside a browser via WebGPU with no server round trip — which matters for privacy, offline use, and cost.

Who It's For

Great fit if you need the same model to reach edge, mobile, web, and server, or if local/offline inference and data privacy outweigh the convenience of a hosted endpoint. Look elsewhere if you only ever serve on one cloud GPU type — a runtime like vLLM will get you there with less compilation overhead — or if you want a polished chat product rather than a deployment engine you wire into your own app.

Back

Information

Websitellm.mlc.ai
OrganizationsCarnegie Mellon University (Catalyst), University of Washington (SAMPL), Shanghai Jiao Tong University, OctoML
AuthorsMLC team (mlc.ai)
Published date2023/04/29

More Items

AI Deploy2018

Triton Inference Server

NVIDIA Corporation

Serves machine learning and deep learning models for cloud, data center, edge and embedded environments. Supports multiple frameworks and backends, dynamic and sequence batching, HTTP/gRPC APIs, Docker deployment and NVIDIA-optimized runtimes.

nvidia ai-inference ai-serving tensorrt pytorch+5

AI Deploy2026

codex-lb

Soju06

Pools multiple ChatGPT/Codex accounts behind a local OpenAI-compatible proxy and dashboard — provides request load balancing, per-account usage/cost tracking, API-key management, and configurable routing strategies.

codex chatgpt ai-api ai-api-management mLOps+5

AI Video2025

Y2A-Auto

Automatically transfers YouTube videos to AcFun and bilibili with an end-to-end pipeline: downloading, ASR, subtitle translation and QC, AI-generated metadata, content moderation, and automated uploads; includes a web dashboard and monitoring.

video ai-tools docker ASR translation+5