Why this matters
Running a capable agentic coding assistant locally usually requires heavy hardware or cloud access. This release packs a Gemma 4 12B fine-tune into compact GGUF quants so you can run a private coding + tool-using agent on modest hardware (≈4.5 GB VRAM/unified memory) while keeping an agentic read→reason→act→verify workflow intact.
What Sets It Apart
- Agentic-first fine-tune: training emphasizes multi-step terminal/tool trajectories (read → reason → act → verify), which markedly improves real-world debugging/terminal loops compared to the base Gemma 4 assistant. The author reports a tau2-bench telecom jump from ~15% (base) to ~55% (v2) under identical local Q8_0 conditions — roughly a 3.5× relative improvement on that agentic benchmark.
- Practical local deployment: ships ready GGUF quants in several sizes (Q3_K_M 5.7 GB, Q4_K_M ~6.87 GB recommended, Q6_K ~9.11 GB, Q8_0 ~11.8 GB) so users can pick a trade-off between VRAM and fidelity. Recommended runtime is llama.cpp with the gemma4_unified loader; a specific llama.cpp build (b9553) is recommended for MTP/speculative-draft support due to loader sensitivity in newer builds.
- Grounded tool behavior: the fine-tune preserves a “read-before-act” habit — it tends to grep/read/ls first and avoid fabricating file paths or values in terminal tasks, matching the base model on a fabrication probe.
- Open license and provenance: published under Apache-2.0 and built on google/gemma-4-12B-it; the release includes a full-precision safetensors master for builders and quantized GGUFs for users.
Who it's for and trade-offs
Great fit if: you need a local coding assistant that can use tools and operate in multi-step terminal workflows, or you want an on-device agentic model that runs on small GPUs or unified-memory laptops.
Look elsewhere if: you need a broad generalist for knowledge-heavy benchmarks (v2 deliberately trades a bit of general MMLU-style breadth for agentic/coding capability), require strong safety guardrails out of the box (v2 is task-focused and not safety-aligned), or depend on GUI-first integrations rather than terminal/tool pipelines.
Practical notes
- Recommended quant: Q4_K_M (sweet spot). Smallest reliable quant: Q3_K_M; Q2_K was withheld. Full-quality: Q8_0.
- Runtime tips: use llama.cpp with --jinja to pass tools via the OpenAI-style tools field; for MTP/speculative decoding, llama.cpp b9553 (commit cited by the author) is noted as verified. If you see repeating-output artifacts, adjust sampler settings (rep_pen and temperature) as recommended by the author.
- Limitations: English-centric, reduced refusals due to task-focused fine-tuning (add external guardrails for production), and some remaining failure modes include over-trying or retry loops on hard agentic tasks.
Bottom line: an opinionated, locally runnable Gemma 4 12B fine-tune that substantially ups agentic/terminal performance at the cost of a small hit to generalist benchmarks — a practical choice for developers who want a private, tool-using coding agent on constrained hardware.
