Why this matters Large language models face a practical bottleneck: context windows are finite while real research requires chaining many decisions and external lookups over long horizons. The surprising gap is not compute but delegation intelligence — knowing when to split, what to delegate, and how to integrate concise returns so the main agent can continue without exhausting context.
Key Findings
- Harness-guided trajectories: The authors design a harness that constrains subagents to return concise, structured summaries and steers the main agent toward high-quality task decomposition. Those guided trajectories encode correct delegation decisions suitable for supervised fine-tuning.
- Supervised fine-tuning for delegation: By training on harness-generated examples, the resulting model internalizes when and how to delegate rather than relying on brittle at-inference orchestration.
- Empirical gains: Their 30B model, SearchSwarm-30B-A3B, achieves 68.1 on BrowseComp and 73.3 on BrowseComp-ZH, reported as the best among comparable-scale models in their evaluation.
- Practical workflow savings: Delegation reduces context load by having subagents perform focused searches and return summarized results, enabling longer multi-step research workflows without blowing the main agent’s context budget.
Who it fits and tradeoffs
Great fit if you need an LLM agent to run long-horizon, research-style tasks where (1) iterative web/search calls are required, (2) structured summaries from subagents suffice, and (3) you can fine-tune models with synthetic supervised trajectories. Look elsewhere if you require rich, unabridged subagent outputs (not summaries), guaranteed real-time interactivity with complex external tools, or if you cannot afford scale-30B models and their inference costs.
Method snapshot
The paper’s pipeline: (1) define a harness that enforces decomposition quality and summary-return constraints for subagents; (2) generate end-to-end trajectories where the main agent delegates, subagents execute constrained subtasks, and return structured summaries; (3) use those trajectories as supervised fine-tuning data so the main model learns delegation policies embedded in its weights. The authors plan to release the harness, training data, and model weights to help reproducibility and follow-up research.
