LogoAIAny
Icon for item

MMAE: A Massive Multitask Audio Editing Benchmark

Provides a comprehensive benchmark for instruction-based audio editing across seven audio modalities and eight operation types, with 2,000 high-fidelity samples and a rubric that decomposes tasks into 17,741 verifiable criteria for multi-dimensional evaluation.

Introduction

Most evaluation work for audio editing remains fragmented and narrowly focused, so progress on general-purpose, instruction-driven audio editing is hard to measure. This benchmark addresses that gap by assembling a broad, taxonomy-driven testbed that stresses real-world complexity — from simple edits to multi-hop, multi-round, and mixed-modality scenarios — while giving evaluators concrete, verifiable checks rather than subjective pass/fail judgments.

Key Findings
  • Breadth and granularity: the benchmark spans seven audio modalities (sound, speech, music, and mixtures), six complexity levels, two levels of granularity, and eight operation types. That design forces models to handle modality shifts and compositional editing, not just single-operation fixes.
  • Rubric-based evaluation: tasks are decomposed into 17,741 verifiable criteria, enabling precise measurement of instruction following, content consistency, and structural correctness rather than relying on coarse heuristics or single metrics.
  • Diagnostic power: with 2,000 curated high-fidelity samples and human-agent collaboration in data curation, the benchmark surfaces systemic weaknesses — leading models show an Exact Match Rate (EMR) below 5% overall and 0% on complex mixed-modality tasks, highlighting gaps in fidelity and multi-step reasoning.
Who It's For and Tradeoffs

Great fit if you need a standardized, diagnostic benchmark to compare audio-editing systems across diverse real-world scenarios (researchers building editing models, teams evaluating instruction-following fidelity, or benchmark-driven model selection). The rubric helps pinpoint specific failure modes (e.g., modality confusion, missed constraints, or context inconsistency).

Look elsewhere if you only need small-scale or domain-specific tests (single-modality, single-operation) or if you require production-ready metrics tied directly to perceptual quality — MMAE emphasizes verifiable, instruction-level correctness over purely subjective MOS-style scores. Curation and rubric granularity improve diagnostic clarity but increase annotation cost and evaluation complexity.

Where It Fits

This benchmark sits between narrowly scoped audio-editing testbeds and broader multimodal benchmarks: it is more comprehensive than operation-specific evaluations but remains focused on edit instruction fidelity (not on downstream creative quality assessments or real-time editing latency). Use it to stress-model reasoning about edits and to track fine-grained regressions across model versions.

Methodology (brief)

Samples were collected and refined via human-agent collaboration and organized into a taxonomy of complexity and operation types. The rubric translates free-form editing instructions into explicit, checkable criteria so that multi-dimensional evaluation (instruction following, content preservation, consistency) can be automated or inspected by raters with clear guidance.

Information

  • Websitearxiv.org
  • AuthorsZiyang Ma, Ruiqi Yan, Ruiyang Xu, Jie Fang, Zhikang Niu, Yi-Wen Chao, Wenming Tu, Tianrui Wang, Auden, Qi Chen, Wenxi Chen, Jiaying Chi, Yanru Huo, Zixuan Jiang, Xiquan Li, Yalin Li, Junxi Liu, Minghao Liu, Binghao Qiang, Yijia Shan, Zheshu Song, Tian Tan, Zixiang Wang, Zeyu Xie, Zhifei Xie, Xiaoyu Xing, Qixiang Xu, Chen Yang, Guanrou Yang, Shan Yang, Yifan Yang, Steve Yves, Haotian Zhang, Haina Zhu, Kai Yu, Liefeng Bo, Eng-Siong Chng, Xie Chen
  • Published date2026/06/05