NexaSDK (NexaML)
NexaSDK is a developer-oriented SDK and inference engine designed to run a wide range of AI models locally and efficiently across heterogeneous hardware — NPUs, GPUs and CPUs — for desktop, mobile and embedded scenarios.
Core capabilities
- Low-level inference engine (NexaML): built from scratch at the kernel level to provide highly optimized inferencing and to enable Day‑0 support for new model architectures and formats.
- Multi-format support: runs models in GGUF, MLX and Nexa's own
.nexaformat, enabling compatibility with many community and vendor model builds. - Cross-platform: desktop (Linux, macOS, Windows), mobile (Android/iOS), and embedded/automotive targets.
- NPU-first optimizations: explicit support and optimizations for NPUs (e.g., Qualcomm Hexagon, Apple ANE, Intel/AMD NPUs) to deliver lower latency and higher throughput on supported devices.
- Multimodal support: LLMs and VLMs, image/audio/text/embedding/ASR/TTS pipelines supported, with demo and CLI integration for multimodal inputs.
- OpenAI-compatible API: can serve models via an OpenAI-compatible REST interface and supports function calling semantics.
Differentiators
- Kernel-level inference engine (not just a wrapper): NexaML implements optimizations at a lower level than wrapper runtimes, allowing earlier and broader hardware support and fine-grained control.
- Mobile and NPU focus: packaged Android/iOS SDKs with NPU/GPU/CPU backends and a set of prebuilt NPU-native model builds (Granite, Qwen3-ANE, Gemma3n, etc.).
- Day‑0 model support and curated model hub: quick availability of new model architectures and model builds in multiple formats, plus a model wishlist and community-driven prioritization.
Typical uses
- On-device inference for mobile apps (real-time assistants, vision+language features).
- Edge and automotive AI (in-car assistants, low-latency multimodal inference).
- Local development and deployment: run and test models locally using the
nexaCLI, serve models via an OpenAI-compatible server, or embed the SDK in applications.
Quickstart & tooling
- CLI: single-line commands to infer from Hugging Face repo IDs (e.g.,
nexa infer NexaAI/Qwen3-VL-4B-Instruct-GGUF). - Prebuilt installers: downloadable CLI packages for Windows, macOS and Linux (including ARM builds with NPU support).
- Documentation & examples: comprehensive docs, Android/iOS bindings, demo apps and a builder bounty program to encourage integration.
Notable integrations and wins (selection from README)
- Highlighted by hardware vendors and blogs for Qualcomm and AMD NPU support.
- Supports models like OmniNeural-4B, Granite-4, Qwen3‑VL, Gemma3n across NPUs and other backends.
- Provides ANE- and Hexagon-optimized model builds and platform-specific downloads.
How it fits in an AI stack
NexaSDK sits at the inference/deployment layer: it bridges model formats and hardware runtimes, enabling developers to run modern foundation and multimodal models locally with production-ready optimizations for mobile and edge NPUs. It complements training and model‑development workflows by focusing on efficient serving and local inferencing.
