Toolkit from InternLM for compressing, quantizing and serving LLMs with INT4/INT8 kernels on GPUs.
Open-source platform for building and operating AI-native apps with agentic workflows, RAG pipelines, model management and observability.
TypeScript toolkit from Vercel for building streaming, multi-provider AI applications across React, Next.js, Vue, Svelte, and more.
One API is a self-hosted key-management and distribution gateway that unifies OpenAI-style access to dozens of LLM providers, enabling centralized quota, billing and user management through a single binary or Docker image.
Xorbits’ universal inference layer (library name `xinference`) that deploys and serves LLMs and multimodal models from laptop to cluster.
LiteLLM is an open-source LLM gateway and Python SDK that lets developers call more than 100 commercial and open-source models through a single OpenAI-compatible interface, complete with cost tracking, rate-limiting, load-balancing and guardrails.
NVIDIA’s open-source library that compiles Transformer blocks into highly-optimized TensorRT engines for blazing-fast LLM inference on NVIDIA GPUs.
CUDA kernel library that brings Flash-attention-style optimizations to any LLM serving stack.
High-performance Python framework and platform for orchestrating collaborative agent “crews”.
FastGPT is an open-source AI knowledge-base platform that combines RAG retrieval, visual workflows and multi-model support to build domain-specific chatbots quickly.
Zero-code CLI & WebUI to fine-tune 100+ LLMs/VLMs with LoRA, QLoRA, PPO, DPO and more.
Microsoft Research approach that enriches RAG with knowledge-graph structure and community summaries.