Running large agentic-code models locally is practical with this GGUF release: it packages quantized weights for CohereLabs' North-Mini-Code-1.0 and step-by-step notes to load and run them with llama.cpp (requires a cohere2_moe patch) or vLLM. The release focuses on making the model usable for tool-enabled coding workflows and long-context agent runs.
Key Capabilities
- Quantized GGUF files and sharded BF16 variants ready for local inference; some quants are single-file, BF16 is sharded. This lets you pick a quant that fits your GPU/CPU trade-offs.
- Compatibility notes and build/run commands for llama.cpp (requires PR branch with cohere2_moe) and vLLM; examples include llama-cli interactive chat, llama-server (OpenAI-compatible), and vLLM server invocation. Recommended sampling: temperature=1.0, top_p=0.95.
- Agentic coding support: the model was post-trained for tool use and interleaved reasoning (supports tool-call templates in Transformers and vLLM’s tool parsers). It supports very long contexts (up to 256K) for extended agent traces.
Who it helps and trade-offs
Great fit if you need to run an agent-capable code-generation LLM locally (offline inference, custom tooling, or research on tool use). Expect large resource needs (model is 30B total with Mixture-of-Experts internals), and some workflows require building llama.cpp from a PR branch or using vLLM main and Cohere’s melody tooling. If you need a drop-in CPU-only model with no build steps or a much smaller footprint, choose smaller or non-MoE models instead.
Where it fits
Use this release to evaluate agentic code workflows, test tool-calling integrations, or deploy locally with GPU offload. For hosted or production-grade serving, pair the quants with vLLM or validated inference stacks that support cohere2_moe architecture.
