AI Agent Papers2026

Orchestra-o1: Omnimodal Agent Orchestration

Orchestrates teams of sub-agents across text, image, audio and video by modality-aware task decomposition, online sub-agent specialization, and parallel execution; introduces DA-GRPO to train Orchestra-o1-8B and reports a ~10.3% accuracy improvement on the OmniGAIA benchmark.

Visit Website

Introduction

Agent swarms have made orchestration—not individual reasoning—the core engineering bottleneck for scaling LLM-based systems. Orchestra-o1 treats orchestration as a first-class design problem: instead of a monolithic multimodal model, it composes many lightweight, modality-aware sub-agents and coordinates them with a unified orchestration mechanism that supports online specialization and parallel subtask execution.

Key Findings

Modality-aware decomposition: tasks are split by modality (text, image, audio, video) so each sub-agent focuses on the signal it handles best — this reduces cross-modal confusion and makes parallel execution effective.
Online sub-agent specialization: the orchestrator spawns and adapts sub-agents at runtime for subtask-specific behavior, which improves flexibility when task demands change mid-dialog.
Parallel execution and scalability: designed to run sub-tasks concurrently, improving throughput on multi-step, multi-source tasks compared with sequential-agent baselines.
DA-GRPO training for agentic RL: a decision-aligned group relative policy optimization method is used to train Orchestra-o1-8B; the trained system outperforms the previous best open-source omnimodal agents, showing a ~10.3% absolute accuracy gain on the OmniGAIA benchmark.

Who it's for and tradeoffs

Great fit if you are building research or production systems that must coordinate multiple modality specialists (e.g., vision, speech, and language) for complex, multi-turn tasks and want an architecture that supports parallelism and runtime specialization. Look elsewhere if you need a single, end-to-end multimodal model (fewer moving parts) or if you cannot bear the engineering and compute costs of running and training multiple sub-agents and agentic RL — the orchestration layer adds system complexity and the DA-GRPO training step incurs additional RL training cost.

Back

Information

Websitearxiv.org
AuthorsFan Zhang, Vireo Zhang, Shengju Qian, Haoxuan Li, Hao Wu, Jinyang Wu, Donghao Zhou, Zhihong Zhu, Zheng Lian, Xin Wang …
Published date2026/06/10

More Items

Computer Vision Papers2026

HumanCLAW: Can Vision-Language Models Act Through a Body?

Siyao Li, Jiawei Gu +16

Evaluates whether vision-language models can make actionable decisions for a physical body by decoupling decision-making from low-level motor execution. Introduces HumanCLAW-Bench with 1,218 long-horizon egocentric episodes across 41 indoor scenes and diagnoses a lack of embodied self-awareness in current VLMs.

vision robotics evaluation benchmarks multimodal+2

Natural Language Processing Papers2026

Keep It InMind: Benchmarking the Implicit-Association Blind Spot in Agent Memory

Ruizhe Li, Mingxuan Du +2

Measures how agent memory systems miss implicitly associated facts by introducing InMind, a 125-task benchmark with paired controls that separate stored-vs-retrieval vs knowledge gaps. Quantifies a large retrieval-interface blind spot and points to routing as the core open problem.

benchmark evaluation paper LLM NLP+3

Natural Language Processing Papers2026

A New Role for Relevance: Guiding Corpus Interaction in Agentic Search

Jiangnan Li, Yuqing Li +3

Turns document relevance into an execution prior for agentic corpus interaction: orders documents for sequential ripgrep traversal, seeds promising entry points with query-relevant paragraphs, and reranks grep matches to surface informative excerpts. Improves the accuracy–efficiency frontier on browse QA and reasoning-intensive retrieval.

retrieval RAG reasoning LLM NLP+3