LogoAIAny
Icon for item

vllm-omni

vLLM-Omni is an open-source framework from the vLLM community for efficient inference and serving of omni-modality models. It extends vLLM's fast autoregressive support to handle multi-modal data (text, image, video, audio), non-autoregressive architectures, heterogeneous outputs, and integrates with Hugging Face models while offering pipeline parallelism, KV-cache optimizations, and an OpenAI-compatible API.

Introduction

Overview

vLLM-Omni is an extension of the vLLM ecosystem focused on serving and inference for omni-modality (multi-modal) models. While vLLM originally targeted large autoregressive text models, vLLM-Omni broadens that capability to support images, video, and audio, plus non-autoregressive generation architectures such as diffusion transformers.

Key features
  • Omni-modality support: processes text, images, video and audio within the same serving framework.
  • Multi-architecture support: handles autoregressive and non-autoregressive models (e.g., DiT, diffusion-like models) and heterogeneous outputs (text, images, multimodal responses).
  • Performance optimizations: inherits vLLM's efficient KV cache management for AR models, implements pipelined stage execution to increase throughput, and supports disaggregated execution with dynamic resource allocation.
  • Flexible pipeline abstraction: provides heterogeneous pipeline primitives to compose complex model workflows and to integrate multiple stages (pre/post-processing, model stages, decoders).
  • Integration with Hugging Face: seamless support for many open-source models available on Hugging Face, including omni models such as Qwen-Omni and Qwen-Image.
  • Scalability & parallelism: supports tensor/pipeline/data/expert parallelism for distributed inference.
  • Developer ergonomics & APIs: streaming outputs, OpenAI-compatible API server, and documentation/quickstart guides for easy adoption.
  • Open license: distributed under the Apache License 2.0.
Typical uses
  • Deploying multi-modal foundation models for production inference (e.g., vision+language assistants).
  • Serving diffusion/parallel-generation models with high throughput and lower latency.
  • Building pipelines that combine multiple model types or modalities and need coordinated resource allocation.
Who it's for
  • MLOps and infra engineers who need a performant, production-ready inference stack for multi-modal models.
  • Researchers and developers who want to prototype or serve multi-modal models with compatibility to Hugging Face and OpenAI-style APIs.
  • Documentation / Quickstart: the project provides hosted docs and guides for installation, supported models, and contribution.
  • Community: a vLLM user forum and developer Slack are available for support and discussion.
License

vLLM-Omni is released under the Apache License 2.0.

More Items