llama.cpp

GGML-based C/C++ implementation that runs LLaMA-family models locally with no dependencies.

Introduction

Overview

llama.cpp enables CPU-only inference via quantized GGUF weights and offers OpenAI-compatible HTTP and WebSocket servers.

Key Capabilities

int8/int4/quant-mulmat kernels in AVX2/AVX-VNNI/NEON
GPU offload (CUDA/Metal/OpenCL)
LoRA/QLoRA finetune utilities

Back

Information

Websitegithub.com
Authorsggml-org
Published date2023/03/10

More Items

Ray

2017

RISELab (UC Berkeley), Anyscale Inc.

Ray is an open-source distributed compute engine that lets you scale Python and AI workloads—from data processing to model training and serving—without deep distributed-systems expertise.

ai-development ai-framework ai-train ai-serving

OpenVINO

2018

Intel

OpenVINO is an open-source toolkit from Intel that streamlines the optimization and deployment of AI inference models across a wide range of Intel® hardware.

ai-development ai-inference ai-serving

NVIDIA Dynamo

2025

NVIDIA

NVIDIA Dynamo is an open-source, high-throughput, low-latency inference framework that scales generative-AI and reasoning models across large, multi-node GPU clusters.

ai-development ai-inference ai-serving nvidia