Most attempts to automate experiments treat each run as a one-off. The core insight here is that autonomous research becomes substantially more productive when experiments, evidence, and distilled lessons are stored and used to guide future strategy — turning many local trials into a cumulative, long-horizon search.
Key Findings
- Persistent Hypothesis Tree (HTR): Maintains linked hypotheses, artifacts, evidence, and distilled insights so successful lessons propagate to future experiments — this reduces repeated dead-ends and focuses search. (So what: saves compute and accelerates progress by reusing verified improvements.)
- Two-tier architecture: a long-lived coordinator manages strategy over the tree while short-lived executors implement isolated worktrees for concrete tests. (So what: separates meta-level strategy from noisy experiment execution, improving fault isolation and reproducibility.)
- Strong empirical gains under Autonomous Optimization: Across six real research tasks (training, harness engineering, data synthesis), the framework outperforms baseline agents and achieves substantially higher held-out gains. (So what: demonstrates practical benefit on end-to-end research-style workflows.)
Who it's for and tradeoffs
Great fit if you want an agentic system that can run iterative ML experiments with minimal step-level supervision and you care about accumulating reusable lessons over many trials. Look elsewhere if you need lightweight single-shot automation, have strict reproducibility constraints across external environments, or lack the compute/resources for long-horizon agent runs. The approach depends on careful orchestration (coordinator + executors), reliable experiment isolation, and the quality of underlying model-based evaluators.
How it works (brief)
Coordinator components manage global policy and update the Hypothesis Tree as results arrive; executors create isolated worktrees to implement and test individual hypotheses. HTR refines branches by admitting verified improvements, propagating distilled heuristics, and refining the search frontier, turning episodic attempts into cumulative progress. Evaluations use an Autonomous Optimization protocol and benchmarks (e.g., MLE-Bench Lite) to measure held-out gains.
