Large on-device LLM use is limited by model format, runtime compatibility, and long-context memory. Qwythos GGUF addresses those barriers by shipping production-ready GGUF quantizations of a post-trained 9B reasoning model so you can run Qwen3.5-derived capabilities locally with standard GGUF runtimes.
What Sets It Apart
- Quantized GGUF builds: multiple fixed-v3 quant files (Q4_K_M, Q5_K_M, Q6_K, Q8_0, BF16) and MTP-enabled variants, letting users choose size vs. fidelity for llama.cpp, Ollama, LM Studio and similar hosts. This makes it practical to run a full-parameter 9B reasoning model on consumer and server hardware.
- 1M-context YaRN rope-scaling: GGUFs include rope-scaling for a 1,048,576-token context window (usable when your runtime/GPU stack supports the memory and KV-cache requirements). Enables very long documents, codebases, or multi-file reasoning sessions.
- Native function-calling and chat template: supports Qwen3.5-style tool-call blocks for tool-use loops; default chat template emits a
<think>...</think>block before final answers (affects prompt parsing and token budgeting). - Multimodal support via mmproj: image input works out of the box when a matching mmproj-Qwythos-*.gguf is loaded; note the vision tower was inherited (not SFT-finetuned) from Qwen3.5-9B, so image behavior matches the base model.
- Practical guidance included: recommended sampling settings, recommended default quant (Q4_K_M), an mmproj F16 file for image encoding, and a v3 hotfixed chat template to avoid looping and agentic issues.
Who It's For and Tradeoffs
Great fit if you need a locally runnable, quantized reasoning model with very long context and tool-use support — researchers, red-teamers, developers integrating local tool loops, or teams building multimodal local deployments. The GGUF packaging simplifies deployment to llama.cpp, Ollama, LM Studio, jan, and other GGUF runtimes.
Look elsewhere if you require a vision tower that was SFT-finetuned on image-paired data (Qwythos' vision path was frozen during SFT), if you need strict safety/sanitization out of the box (model is described as "uncensored" and requires app-level safety), or if you cannot meet the memory/GPU requirements to use the full 1M context (the full window often needs multi-GPU or aggressive KV-cache offload).
