Most evaluation work for audio editing remains fragmented and narrowly focused, so progress on general-purpose, instruction-driven audio editing is hard to measure. This benchmark addresses that gap by assembling a broad, taxonomy-driven testbed that stresses real-world complexity — from simple edits to multi-hop, multi-round, and mixed-modality scenarios — while giving evaluators concrete, verifiable checks rather than subjective pass/fail judgments.
Key Findings
- Breadth and granularity: the benchmark spans seven audio modalities (sound, speech, music, and mixtures), six complexity levels, two levels of granularity, and eight operation types. That design forces models to handle modality shifts and compositional editing, not just single-operation fixes.
- Rubric-based evaluation: tasks are decomposed into 17,741 verifiable criteria, enabling precise measurement of instruction following, content consistency, and structural correctness rather than relying on coarse heuristics or single metrics.
- Diagnostic power: with 2,000 curated high-fidelity samples and human-agent collaboration in data curation, the benchmark surfaces systemic weaknesses — leading models show an Exact Match Rate (EMR) below 5% overall and 0% on complex mixed-modality tasks, highlighting gaps in fidelity and multi-step reasoning.
Who It's For and Tradeoffs
Great fit if you need a standardized, diagnostic benchmark to compare audio-editing systems across diverse real-world scenarios (researchers building editing models, teams evaluating instruction-following fidelity, or benchmark-driven model selection). The rubric helps pinpoint specific failure modes (e.g., modality confusion, missed constraints, or context inconsistency).
Look elsewhere if you only need small-scale or domain-specific tests (single-modality, single-operation) or if you require production-ready metrics tied directly to perceptual quality — MMAE emphasizes verifiable, instruction-level correctness over purely subjective MOS-style scores. Curation and rubric granularity improve diagnostic clarity but increase annotation cost and evaluation complexity.
Where It Fits
This benchmark sits between narrowly scoped audio-editing testbeds and broader multimodal benchmarks: it is more comprehensive than operation-specific evaluations but remains focused on edit instruction fidelity (not on downstream creative quality assessments or real-time editing latency). Use it to stress-model reasoning about edits and to track fine-grained regressions across model versions.
Methodology (brief)
Samples were collected and refined via human-agent collaboration and organized into a taxonomy of complexity and operation types. The rubric translates free-form editing instructions into explicit, checkable criteria so that multi-dimensional evaluation (instruction following, content preservation, consistency) can be automated or inspected by raters with clear guidance.
