LogoAIAny
Icon for item

NexaSDK

NexaSDK is a cross‑platform developer toolkit and low‑level inference engine (NexaML) for running AI models locally on NPUs, GPUs and CPUs. It supports GGUF, MLX and .nexa model formats, provides Day‑0 support for new architectures, multimodal capabilities (text, vision, audio), mobile SDKs (Android/iOS), OpenAI‑compatible APIs, and optimized NPU support.

Introduction

NexaSDK (NexaML)

NexaSDK is a developer-oriented SDK and inference engine designed to run a wide range of AI models locally and efficiently across heterogeneous hardware — NPUs, GPUs and CPUs — for desktop, mobile and embedded scenarios.

Core capabilities
  • Low-level inference engine (NexaML): built from scratch at the kernel level to provide highly optimized inferencing and to enable Day‑0 support for new model architectures and formats.
  • Multi-format support: runs models in GGUF, MLX and Nexa's own .nexa format, enabling compatibility with many community and vendor model builds.
  • Cross-platform: desktop (Linux, macOS, Windows), mobile (Android/iOS), and embedded/automotive targets.
  • NPU-first optimizations: explicit support and optimizations for NPUs (e.g., Qualcomm Hexagon, Apple ANE, Intel/AMD NPUs) to deliver lower latency and higher throughput on supported devices.
  • Multimodal support: LLMs and VLMs, image/audio/text/embedding/ASR/TTS pipelines supported, with demo and CLI integration for multimodal inputs.
  • OpenAI-compatible API: can serve models via an OpenAI-compatible REST interface and supports function calling semantics.
Differentiators
  • Kernel-level inference engine (not just a wrapper): NexaML implements optimizations at a lower level than wrapper runtimes, allowing earlier and broader hardware support and fine-grained control.
  • Mobile and NPU focus: packaged Android/iOS SDKs with NPU/GPU/CPU backends and a set of prebuilt NPU-native model builds (Granite, Qwen3-ANE, Gemma3n, etc.).
  • Day‑0 model support and curated model hub: quick availability of new model architectures and model builds in multiple formats, plus a model wishlist and community-driven prioritization.
Typical uses
  • On-device inference for mobile apps (real-time assistants, vision+language features).
  • Edge and automotive AI (in-car assistants, low-latency multimodal inference).
  • Local development and deployment: run and test models locally using the nexa CLI, serve models via an OpenAI-compatible server, or embed the SDK in applications.
Quickstart & tooling
  • CLI: single-line commands to infer from Hugging Face repo IDs (e.g., nexa infer NexaAI/Qwen3-VL-4B-Instruct-GGUF).
  • Prebuilt installers: downloadable CLI packages for Windows, macOS and Linux (including ARM builds with NPU support).
  • Documentation & examples: comprehensive docs, Android/iOS bindings, demo apps and a builder bounty program to encourage integration.
Notable integrations and wins (selection from README)
  • Highlighted by hardware vendors and blogs for Qualcomm and AMD NPU support.
  • Supports models like OmniNeural-4B, Granite-4, Qwen3‑VL, Gemma3n across NPUs and other backends.
  • Provides ANE- and Hexagon-optimized model builds and platform-specific downloads.
How it fits in an AI stack

NexaSDK sits at the inference/deployment layer: it bridges model formats and hardware runtimes, enabling developers to run modern foundation and multimodal models locally with production-ready optimizations for mobile and edge NPUs. It complements training and model‑development workflows by focusing on efficient serving and local inferencing.