AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding

• AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding AuthorsSanjay Chowdhuryâ , Karren D. • Yang**, Xudong Liu, Fartash Faghri, Pavan Kumar Anasosalu Vasu, Oncel Tuzel, Dinesh Manochaâ , Chun-Liang Li, Raviteja Vemulapalli View publication Copy Bibtex Recent multimodal large language models (MLLMs) such as GPT-4o and Qwen3-Omni show strong perception but struggle in multi-speaker, dialogue-centric settings that demand agentic reasoning tracking who speaks, maintaining roles, and grounding events across time. • These scenarios are central to multimodal audio-video understanding, where models must jointly reason over audio and visual streams in applications such as conversational video assistants and meeting analytics. • We introduce AMUSE, a benchmark designed around tasks that are inherently agentic, requiring models to decompose complex audio-visual interactions into planning, grounding, and reflection steps. • It evaluates MLLMs across three modes zero-shot, guided, and agentic and six task families, including spatio-temporal speaker grounding and multimodal dialogue summarization. • Across all modes, current models exhibit weak multi-speaker reasoning and inconsistent behavior under both non-agentic and agentic evaluation.

Article Summaries:

Recent multimodal large language models (MLLMs) such as GPT-4o and Qwen3-Omni show strong perception but struggle in multi-speaker, dialogue-centric settings that demand agentic reasoning tracking who speaks, maintaining roles, and grounding events across time. These scenarios are central to multimodal audio-video understanding, where models must jointly reason over audio and visual streams in applications such as conversational video assistants and meeting analytics. We introduce AMUSE, a benchmark designed around tasks that are inherently agentic, requiring models to decompose complex audio-

Sources:

https://machinelearning.apple.com/research/amuse (Latest source article published: 2026-02-24 00:00 UTC)