35B Mixture-of-Experts agent model for long-horizon, multi-domain agent workflows; trained with a knowledge–action infrastructure that produces ~45K-token trajectories and supports native tool calling and function integration for research and deployment.
Benchmark for evaluating procedural skill evolution in LLM agents: isolates reusable skill bodies, role-specific work surfaces, and hidden oracle assets to measure whether skill refinements transfer across tasks, roles, and model backbones. Includes 382 workplace tasks, 22 skills, and a controlled evaluation protocol.
Thinking-off fine-tune for coding-agent workflows that prioritizes fast next-step decisions, lower token usage and stable multi-turn tool calling. Highlights: MoE 35B base, MTP speculative decoding, SWE-bench 62.4% (300 cases). Best for local agent loops and automated debug cycles; requires disciplined harnessing and schema consistency.
Most current scaling focuses on parameters; this project demonstrates an alternative: scale the agent "horizon" (long, structured trajectories) to achieve frontier-level agentic performance with a ~35B MoE model. Agents-A1 is trained on domain-grounded knowledge–action trajectories (average length ~45K tokens) and unified across six heterogeneous domains using a three-stage recipe (full-domain supervised fine-tuning, domain teacher models, and multi-teacher on-policy distillation). The result is a deployable agent that narrows the gap with much larger models on multi-step, tool-using, and research-oriented tasks.
Great fit if you need an open reproducible agentic model that handles very long contexts and tool-enabled workflows (research labs, agent developers, MLops teams using vLLM/SGLang). Look elsewhere if you need a lightweight on-device assistant or minimal-resource inference: Agents-A1 expects substantial memory and serving infrastructure to realize its 262K+ context and MoE runtime advantages. Also note practical dependencies—best results reuse the provided serving stacks (vLLM, SGLang) and quantized variants for constrained hardware. Operational caveats include increased complexity around tool chains (errors in external tools can cascade) and standard model risks like hallucination when verifier/tool signals are noisy.